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ABSTRACT 


In  this  Phase  I  study  funded  under  the  Small  Business  Innovation  Research  (SBIR)  program, 
statistical  methods  are  developed  using  the  predictive  inference  and  entropy  approach.  Previous 
recent  research  has  derived  entropy  as  the  natural  measure  of  model  apprcnrimation  error  from 
the  fundamental  statistical  principles  of  sufficiency  and  repeated  «w»pifag  in  this  study,  the 
areas  of  nonnested  multiple  comparison,  multivariable  rim*  series  analysis,  adaptive  ti«w«  series 
analysis  of  changing  procesres,  and  optimal  small  sample  inference  are  investigated.  Constrained 
maaimum  likelihood  methods  are  developed  for  general  nonnested  multiple  comparison.  For  the 
asymptotic  optimality  of  these  methods,  a  condition  on  the  Fisher  information  and  Hesnan 
matrices  must  be  satisfied.  Applying  these  results  to  multivariate  time  series  analysis,  lower 
bounds  ate  derived  for  the  achievable  accuracy  of  the  — transfer  function  and  spectral 
matrices,  Markov  and  canonical  variate  analysis  (CVA)  provide  a  ■— —  of  numerically  and  sta¬ 
tistically  stable  model  fitting  of  multivariable  time  series,  and  methods  provide  a  basis  for 
modeling  and  fitting  time  varying  models  of  changing  prtxcsses.;J4ethods  are  derived  for  the 
optimal  selection  of  data  length  for  fitting  slowly  changing  ptocesarn  well  as  for  optimal  selec¬ 
tion  of  the  dam  interval  for  detection  of  abrupt  changes.  Optimal  smakmmpie  methods  for  mul¬ 
tivariate  analysis  are  studied,  and  entropy  methods  are  shown  to  provide  significant  improvements 
in  very  small  samples.  Recommendations  for  Phase  0  research  and  development  focus  on  the 
adaptive  and  nonadaptive  time  series  analysis  procedures  developed  in  this  study. 
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L  INTRODUCTION  AND  OVERVIEW 


Id  this  study,  statistical  methods  are  developed  using  predictive  inference  and  entropy.  This 
approach  to  statistical  inference  allows  the  treatment  of  several  difficult  statistical  problems  that 
are  not  easily  dealt  with  using  traditional  statistical  methods.  The  particular  statistical  problems 

addressed  are 

•  statistical  model  building  involving  the  determination  of  parametric 
model  structure  and  order  in  the  general  case  of  multiple  nonnested 
alternatives, 

•  time  series  modeling  and  forecasting  involving  the  determination  of 
parametric  model  structure  and  order, 

a  adaptive  time  series  analysis  involving  optimal  methods  for  tracking 
slow  changes  as  well  as  for  detecting  abrupt  changes  or  failures, 

a  small  sample  inference  for  multivariate  distributions  of  the  exponen¬ 
tial  family. 

A  number  of  isnies  in  these  topics  are  resolved  naturally  in  the  predictive  inference  and  entropy 
setting.  This  report  provides  an  overview  of  the  progress  of  the  Phase  I  research  with  detailed 
papers  included  in  the  Appendices. 

The  recent  interest  in  predictive  distributions  has  come  from  several  directions.  Modem 
developments  apparently  begin  with  Jeffreys  (1961,  p.143)  using  a  Bayesian  approach  as  has  much 
of  the  work  following  (Aitchison  and  Dunsmore,  1975,  preface  and  pJ9).  The  frequentist 
viewpoint  taken  in  this  proposal  has  been  stimulation  by  small  sample  problems  (Murray,  1977, 
1979),  model  order  and  structure  determination  problems  involving  parametric  models  (Akaike, 
1973,  1974,),  and  nonnested  multiple  comparison  problems  (Larimore,  1977a,  1977b).  Classical 
methods  are  conceptually  ill-suited  or  perform  poorly  in  practice  on  such  problems. 

In  the  first  Chapter,  the  approach  using  predictive  inference  and  entropy  is  described.  The 
basis  of  this  approach  is  tbs  derivation  of  entropy  from  the  fundamental  statistical  principles  of 
sufficiency  and  repeated  —  mpling  in  the  context  of  the  predictive  inference  setup  as  first 
presented  in  Larimore  (1963a).  This  provides  a  sound  theoretical  foundation  that  was  previously 
lacking  for  the  use  of  entropy  as  the  natural  measure  of  the  error  in  approximating  a  true  future 
density  by  an  predictive  density  based  upon  a  present  sample.  The  generality  of  this 

entropy  measure  allows  comparison  of  statistical  inference  methods  and  the  derivation  of  more 
optimal  inference  procedures  including 
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•  general  inference  methods  such  u  parametric  or  noopanmetric 

methods 

•  exact  evaluation  of  small  sample  procedures 

•  determination  of  model  order  or  structure  including  the  case  of  non* 
nested  multiple  comparison 

•  Him  series  analysis  including  definition  of  optimal  tracking  of  time 
vaiying  processes  and  optimal  detection  of  abrupt  changes. 

The  generality  of  the  predictive  inference  and  entropy  approach  provides  a  basis  for  the  generali¬ 
zation  of  the  present  statistical  and  predictive  inference  methods  to  more  general  statistical  prob¬ 
lems. 

In  Chapter  2,  the  multiple  comparison  of  nonnested  constrained  models  is  developed.  Previ¬ 
ous  developments  in  multiple  comparison  have  considered  largely  the  nested  case  and  assume  that 
the  true  model  is  contained  in  one  of  the  models.  In  the  present  study,  the  case  of  constrained 
«n«riminn  likelihood  estimation  is  considered  where  the  true  model  may  not  be  contained  in  any 
of  the  hypothesized  models.  The  entropy  measure  provides  a  measure  that  allows  the  multiple 
comparison  problem  to  be  viewed  as  a  model  approximation  problem.  In  this  more  general  con¬ 
text  the  AIC  procedure  and  generalizations  of  it  are  found  to  give  asymptotically  optimal  predic¬ 
tive  inference  procedures  as  measured  by  entropy.  In  the  nested  case,  these  procedures  reduce  to 
the  generalized  likelihood  ratio  (GLR)  test  where  the  probability  of  rejection  is  a  function  of  the 
number  of  additional  parameters  in  the  alternative  model  not  contained  in  the  null  hypothesis. 

Time  series  analysis  for  stationary  processes  are  considered  in  Chapter  4.  The  entropy  meas¬ 
ure  provides  a  direct  interpretation  of  the  achievable  accuracy  in  estimation  of  the  power  spec¬ 
trum  of  a  process.  The  entropy  is  expressed  as  a  squared  relative  error  in  estimating  the  spec¬ 
trum.  A  generalization  of  this  to  multiple  time  series  relates  to  principle  components  of  the  pro¬ 
cess  cross  spectral  matrix.  A  lower  bound  is  determined  such  that  the  expected  integral  of  the 
squared  relative  error  in  spectral  estimation  is  bounded  by  the  number  of  estimated  parameters 
divided  by  twice  the  sample  tize.  An  example  of  spectral  estimation  of  an  ARMA(43)  process 
using  spectral  smoothing,  Autoregressive  modeling,  and  ARMA  modeling  shows  the  relative  error 
in  these  estimation  methods  as  dependent  on  the  number  of  estimated  parameters. 

The  topics  of  Markovian  time  series  or  state  space  models  provides  an  approach  to  time 
series  analysis  that  is  readily  computable  and  is  easily  extended  to  the  case  of  changing  processes. 
The  general  state  space  model  form  is  developed  for  Markov  processes.  The  canonical  variate 
analysis  (CVA)  method  gives  a  direct  and  numerically  stable  computational  method  for  determin¬ 
ing  state  space  models  from  observational  data.  The  basic  computational  method  is  the  general¬ 
ized  singular  value  decompoaitioo  (SVD).  This  method  allows  for  the  direct  determination  of  the 
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ryrimai  model  state  order  without  the  computationally  intensive  fitting  of  such  models  for  the 
evaluation  of  model  fit.  Once  the  model  state  order  is  determined,  the  state  space  model  coeffi¬ 
cients  are  empty  computed  by  regression.  This  method  generalizes  easily  to  changing  processes. 

Adaptive  time  series  methods  are  developed  in  Chapter  S.  Primarily  two  types  of  changes 
are  considered,  slowly  varying  changes  and  abrupt  changes  such  as  faults.  Time  varying  Markov 
processes  are  developed  for  such  changing  processes.  Such  processes  provide  the  hypothesized 
models  for  developing  optimal  tracking  and  detection  of  abrupt  changes.  An  A1C  based  pro¬ 
cedure  is  derived  for  the  near  optimal  selection  of  the  data  length  to  use  in  model  fitting.  An 
example  is  given  of  estimating  the  spectrum  of  a  time  varying  processes  that  gives  results  near  the 
best  previous  solutions  that  are  much  more  specialized.  For  abrupt  change  detection,  a  generali¬ 
zation  of  the  A1C  procedure  is  required  since  the  comparison  of  models  fitted  on  different  data 
intervals  is  required  which  is  not  considered  in  the  AIC  formulation.  Application  of  these 
methods  to  simulated  data  of  abrupt  changes  in  an  ARMA(4,3)  processes  including  jumps  in  the 
state,  changes  in  the  dynamics,  and  change  in  the  variance  of  the  excitation  noise  processes, 
demonstrates  that  the  procedure  is  sensitive  to  the  detection  of  these  very  different  types  of 
abrupt  changes. 

Small  sample  multivariate  inference  procedures  are  described  in  Chapter  6.  Since  the 
entropy  measure  gives  an  exact  rational  measure  of  the  relative  error  of  statistical  inference  pro¬ 
cedures  in  «"»*!  samples,  it  provides  the  bases  for  evaluation  and  development  of  small  sample 
inference  methods.  The  historical  approach  to  predictive  inference  involves  the  derivation  of  a 
Bayesian  predictive  density.  Although  the  method  is  Bayesian,  in  certain  instances,  the  resultant 
predictive  density  has  certain  invariance  properties  which  lead  to  an  optimal  predictive  density  in 
terms  of  the  entropy  measure.  Another  approach  involves  the  direct  solution  for  the  optimal 
invariant  predictive  density  minimizing  the  entropy  measure.  This  optimal  invariant  procedure 
leads  to  the  same  predictive  density  as  the  Bayesian  predictive  density  using  a  noninformative 
prior.  These  methods  are  compared  for  the  multivariate  normal  distribution  with  the  estimative 
and  best  normal  estimation  procedures. 


2.  APPROACH  USING  PREDICTIVE  INFERENCE  AND  ENTROPY 


The  coocepcs  of  prediction  and  inference  baaed  on  a  aet  of  data  are  very  old  and  underlie 
I  much  of  the  acientific  method.  While  the  scientific  method  haa  been  much  discussed  in  philo¬ 

sophical  and  qualitative  terms,  there  has  been  very  little  in  the  literature  from  a  basic  statistical 
viewpoint.  The  most  extensive  literature  appears  to  be  that  aaaociated  with  predictive  densities  or 
predictive  distributions  (see  Aitchiaon  and  Dunsmore,  1975).  That  approach  is  to  a  large  degree 
Bayes inn,  although  more  recent  treatments  have  developed  a  purely  frequency  sampling  interpre- 
!  tation  in  connection  with  use  of  entropy  or  Kullback  information.  The  weak  point  in  the  fre¬ 

quency  approach  was  the  seemingly  arbitrary  use  of  the  entropy  measure  of  model  approximation 
error.  More  recently,  however,  the  result  of  Larimore  (1983a)  has  established  the  fundamental 
nature  of  the  entropy  measure  based  upon  the  statistical  principles  of  sufficiency  and  repeated 
m mpling  The  entropy  measure  has  in  addition  a  very  natural  interpretation  as  the  log  relative 
odds  in  comparing  two  predictive  densities  in  predicting  the  future  sample.  This  gives  a  central 
role  to  the  entropy  measure.  The  use  of  the  entropy  measure  for  decisions  on  model  order  and 
structure  was  pioneered  by  Akaike  (1973),  and  has  been  applied  to  many  diverse  statistical  prob- 
jj  lems  particularly  in  time  series  analysis.  The  justification  given  to  the  entropy  measure  in  the 

|  Akaike  approach,  however,  has  been  largely  heuristic.  Because  of  the  importance  of  the  justifi- 

\  cation  of  the  entropy  measure,  the  derivation  and  important  concepts  are  outlined  below  in  Sec- 

I  tion  2.1.  The  use  of  entropy  in  comparing  model  structure  selection  procedures  and  for  exact 

small  sample  inference  is  discussed  in  Section  22.  This  approach  to  developing  statistical  pro¬ 
cedures  using  predictive  inference  and  entropy  is  then  applied  to  the  various  topics  in  the  follow¬ 
ing  chapters  of  the  report. 

|  2JI  Dwtvatiea  of  the  Entropy  Measure  of  Approximation 

Predictive  inference  involves  an  experimental  situation  with  two  trials,  an  informative  trial 
with  observations  x  and  a  predictive  trial  with  observations  y.  The  joint  distribution  of  the  two 
trials  is  permitted  any  statistical  dependence  and  is  described  by  a  joint  probability  density  p(x  j). 
The  objective  is  to  choose  a  predictive  distribution  or  density  which,  for  each  posable  observed  x , 
is  a  probability  density  for  the  future  outcome  y.  More  precisely,  consider  a  family  p(y\  x,a)  ot 
predictive  densities  where  the  index  a  specifies  a  particular  predictive  distribution.  For  a  particu¬ 
lar  choice  of  a,  p(y\  x,a)  can  be  viewed  as  a  conditional  probability  density  of  y  given  x  for 
predicting  the  distribution  of  the  future  observation  y  given  an  observed  value  of  x.  The  predic¬ 
tive  inference  problem  involves  selection  of  a  criterion  of  fit  for  appraising  the  goodness  of 
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approximation  to  the  true  conditional  probability  density  p(y\  x)  by  the  various  predictive  densi¬ 
ties  p(y j  x,a)  specified  by  different  a.  The  choice  of  such  a  criterion  of  fit  is  the  primary  topic  of 
this  section.  Negative  entropy  is  derived  as  the  natural  measure  of  model  approximation  error  for 
any  predictive  distribution. 

Consider  a  family  C  -  {pa(yjx),  a €4}  of  predictive  densities  for  approximating  the  true 
density  p.(y\  x)  of  the  predictive  experiment  y  given  the  informative  experiment  x ,  where  x  and  y 
are  vectors  of  dimension  K  and  L  respectively  with  true  joint  density  p.(xj).  For  the  predictive 
inference  problem,  a  relative  measure  of  goodness  of  approximation  of  p.(y|  x)  by  the  various 
pjy |  x)  is  desired.  To  this  end,  a  repeated  sampling  experiment  is  considered  in  which  joint  ran¬ 
dom  samples  (x,j,)  for  are  drawn  repeatedly  from  a  population  with  density  p.  (*o0- 

The  probability  density  of  the  joint  predictive  experiments  Y  ~(yl . yN)  predicted  by  the  a-th 

model  using  X  *  (xj . xM)  is 

PaO'lJQ-nP.foU)  (2-1) 


The  probability  density  for  Y  can  be  considered  as  indexed  by  the  pair  (a,X)-  Statistical  infer¬ 
ence  is  considered  about  the  true  density  p,(Y |  X)  of  Y  from  among  the  family  of  probability  den¬ 
sities  F  -  { pa(X|  X),  a €4}  for  a  fixed  X . 

To  consider  the  essential  statistical  information  about  the  future  sample  Y  given  by  the 
predictive  densities  pa(F|X),  the  sufficiency  of  the  likelihood  function  (Zacks,  1971,  p.  61)  is 
used.  From  this  principle,  any  inferences  about  the  family  F  drawn  on  the  basis  of  the  sample 
(T|  X)  follow  from  the  observed  values  of  the  likelihood  function  pJiY\ X)  tot  a  (A.  The  set  of 
likelihood  ratios  formed  from  pairs  of  these  likelihoods  is  also  a  sufficient  statistic  (Cox  and  Hink- 
ley,  1974,  p.  20-1,  see  also  p.  37-9  for  a  discusrion  of  likelihood  and  sufficiency  principles). 

For  inference  about  the  densities  px  and  p2,  *11  of  the  information  is  contained  in  the  likeli¬ 
hood  ratio 


.  *  P\<yt\  *<) 

N  UiPafoU) 


(2-2) 


which  has  the  intuitive  interpretation  of  the  relative  odds  of  observing  the  data  Y  of  the  repeated 
predictive  trials  from  each  of  the  distributions  px  and  p2  Pven  fixed  informative  data  X.  The 
behavior  of  A*  as  the  number  of  repetitions  becomes  large  is  most  easily  seen  by  expressing  it  as 

tl  1  ,  .  ,  „  ..  1  !L  .  Pi(yi\*i) 

ta  7  -  hm  -X 
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-  ffp*(y\  *)  lo*p^j'x)  dy  p'(x)  **  (2-3) 

For  a  Urge  number  of  repetitions,  the  odds  will  overwhelmingly  favor  px  or  p2  if  the  limit  is 
respectively  strictly  positive  of  strictly  negative.  The  preference  for  one  distribution  over  the 
other  as  expressed  by  the  likelihood  ratio  tends  to  grow  exponentially  with  the  number  N  of 
repeated  trials.  If  (2-3)  is  zero,  then  there  will  be  no  consistent  preference  with  large  numbers  of 
trials.  Although  for  a  finite  number  N  of  repetitions  the  likelihood  ratio  Aw  depends  upon  the 
particular  samples  (X,Y),  asymptotically  for  large  numbers  of  repetitions  this  dependence  disap¬ 
pears. 

The  direct  pairwise  comparison  of  predictive  densities  is  not  necessary  if  the  Kullback- 
LeiWer  conditional  discrimination  information  (Kullback  and  Leibler,  1931;  Kullback,  1939,  p.  13) 

!j\ x(p* <pJ  m  fp>(y\  -0  lpgp*^)  dy  (2^) 

of  pa  relative  to  p,  is  used  which  is  a  function  of  x.  Note  that  the  order  of  pa  and  p.  are  not 
interchangeable  with  the  latter  playing  the  role  of  the  truth. 

The  likelihood  ratio  (2-3)  is  expressed  in  terms  of  the  Kullback  information  as 

Jj®  jf  lo*  A*  -  £,{/,,  t(p.  - 1^  ,(p.fid }  (2-5) 

where  E„  denotes  expectation  with  respect  to  the  true  density  p,(x).  The  criterion  is  thus  deter¬ 
mined  as  the  negative  entropy ,  or  negentropy  for  brevity,  defined  as 

-  fp»(*)dx  fp.(y\  x)  log -  ^  dy  (2-6) 

the  expected  Kullback  conditional  information  of  the  predictive  density  relative  to  the  true  condi¬ 
tional  density  p,(y ]x).  In  the  repeated  sampling  experiment,  the  predictive  density  with  the 
smaller  negentropy  relative  to  the  true  is  ultimately  preferred.  The  negentropy  (2-6)  thus  orders 
the  goodness  of  a  set  of  predictive  denstiea  in  approximating  the  true  density.  Also  in  comparing 
any  two  predictive  densities  px  and  p2,  the  respective  difference  has  the  intuitive  interpretation  as 
the  exponential  rate  at  which  the  likelihood  ratio  diverges. 

The  above  derivation  of  the  entropy  measure  of  approximation  of  a  predictive  density  uses 
only  the  predictive  inference  setup  in  the  repeated  sampling  context  along  with  the  principle  of 
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sufficiency.  The  sufficiency  principle  is  one  of  the  few  generally  accepted  principles  in  statistical 
inference.  Various  repeated  sampling  principles  have  been  formulated,  however  the  difficulty  has 
been  the  choice  of  an  evaluation  criterion  for  comparing  various  sampling  distributions.  The 
entropy  measure  gives  a  criterion  that  is  based  upon  basic  statistical  principles  of  inference. 

22  Use  Use  of  Entropy  in  Statistical  Inference 

The  entropy  measure  of  error  in  approximating  a  predictive  density  is  very  general  and  can 
be  applied  in  diverse  modeling  problems.  In  this  section,  some  of  the  general  model  selection 
problems  are  described  which  include  the  nonnested  multiple  comparison  problem,  adaptive  time 
series  analysis  of  changing  processes,  and  optimal  small  multivariate  methods. 

From  the  derivation  of  the  entropy  measure,  it  can  be  seen  that  the  entropy  measure  has  a 
number  of  .very  attractive  features: 

•  It  applies  to  completely  general  modeling  problems  including  non- 
parametric  methods. 

•  It  applies  exactly  to  small  samples. 

•  Only  the  fundamental  statistical  principles  of  sufficiency  and 
repeated  sampling  are  used. 

•  It  applies  to  time  correlated  problems  such  as  time  series  model 
identification  and  tracking. 

•  Statistical  inference  can  be  fundamentally  viewed  as  model  approxi¬ 
mation. 

Note  also  that  the  predictive  distribution  can  include  an  entire  model  structure- 
determination/parameter-estimation  scheme  by  setting 

Pkly\  *)  -  p(yfii(x)(x))  (2-7) 

where  for  every  x,  4(x)  is  the  *  minimizing  a  model  structure  determination  criterion.  Thus  for 
each  a,  pa  can  be  regarded  as  a  model  fitting  procedure  including  the  choice  k(x)  of  model  struc¬ 
ture. 


The  negentropy  measure  is  entirely  applicable  to  exact  small  sample  inference,  system  iden¬ 
tification,  and  detection  of  abrupt  changes  that  include  decisions  among  a  multitude  of 
parametric  model  structures  which  may  be  nonnested.  Several  predictive  inference  problems 
have  been  considered  in  the  literature.  Previous  work  has  used  the  negative  entropy  measure  in 
much  more  restricted  formulations  where 


,v 
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the  informative  and  predictive  samples  were  assumed  to  be  indepen¬ 
dent  (Aitchison  and  Dunsmore  (1975),  Ak&ike  (1973))  which  does 
not  include  the  time  series  forecasting  problem 


•  the  use  of  the  negative  entropy  measure  was  considered  as  arbitrary 
(Aitchison  (1975),  Murray  1979))  or  justified  only  asymptotically  for 
large  samples  by  heuristic  arguments  (Akaike  (1973)) 

•  the  negative  entropy  measure  was  justified  as  a  basis  for  comparing 
distinct  parametric  model  structures  (Akaike  (1973)),  but  not  for 
comparing  model  selection  procedures  which  include  choice  of  the 
model  structure  (Larimore,  1983a) 

•  only  the  case  of  nested  structures  such  as  autoregressive  models 
were  justified  (Akaike  (1973))  although  there  has  been  wide  spread 
application  of  it  to  the  general  nonnested  case  such  as  ARMA 
models 

•  the  previous  literature  on  the  use  of  information  theory  in  statistical 
inference  justifies  its  use  by  arguments  of  information  transmission, 
a  set  of  postulates  supposed  to  be  obvious,  or  by  analogy  with 
entropy  in  statistical  mechanics  none  of  which  are  convincing  from 
the  point  of  view  of  statistical  inference  (Kendall  (1973),  Hart 
(1971)). 

Thus  the  results  of  Larimore  (1983a)  give  a  solid  theoretical  justification  for  the  use  of  the  nega¬ 
tive  entropy  measure  in  a  general  setting  which  makes  possible  the  further  general  development 
of  predictive  inference  statistical  methods. 

For  the  parametric  case,  the  very  general  considerations  above  simplify  somewhat.  For  the 
structure  determination  problems,  an  estimator  of  the  form  as  in  (2-7)  associates  a  parameter  esti¬ 
mate  •(*)  with  each  possible  value  a  of  the  sample  space.  To  simplify  the  discussion  in  this  sec¬ 
tion,  we  can  consider  that  the  informative  experiment  x  (fit  set)  and  predictive  experiment  y 
(check  set)  are  independently  and  identically  distributed  K-dimeasional  vectors.  In  Section  4,  the 
general  dependent  time  aeries  analyris  case  will  be  discussed.  We  predict  the  density  of  y  by 
p(y  ,•(*))  where  p(y,t)  is  the  parameterized  clam  of  densities  for  y.  The  negative  entropy  meas¬ 
ure  (2-6)  reduces  in  the  parametric  case  to  that  suggs start  by  Akaike  (1973)  and  is  expressible  as 

«<p.  J)  -  £,*<*./)  -  E,  /  p(y.h.)  dy  (2-8) 

p(y  •CO) 

when  *  denotes  the  true  value  of  ihe  parameter  •  and  where  E,  denotes  expectation  with 
respect  to  the  informative  sample  <  rhai  me  estimator  S(r )  may  involve  different  model  orders 
or  structures  as  in  (2-7)  is  ao  conceptual  difficulty  although  it  may  complicate  the  evaluation  of 
the  aefsntropy  (2-6).  The  predictive  mrnpte  »  t check  set)  is  never  actually  drawn,  but  we  wish  to 
devise  derimon  procedures  which  would  optimally  predict  in  terms  of  (2-6). 
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The  major  problem  is  to  devise  model-estimation/ structure-determination  schemes 

which  come  close  to  minimizing  the  negentropy.  A  major  step  in  that  direction  was  made  by 
Akaike  (1973)  in  proposing  an  extension  of  the  maximum  likelihood  method  to  compare  different 
model  orders  or  structures.  Suppose  that  0*(x)  is  the  maximum  likelihood  estimator  for  a  given 
restriction  of  the  parameters  9  to  a  subspace  Hk  that  is  defined  for  every  x  in  the  sample  space. 
Then  we  wish  to  partition  the  sample  space  into  the  disjoint  subsets  Xk  so  that  for  x  Of ,  the  esti¬ 
mator 

»*•<,)  -  «*(*)  for  x  Of*  (2-9) 

is  used.  Akaike  (1973)  shows  that  asymptotically  for  nested  models,  an  unbiased  estimate  of  the 
negentropy  using  the  maximum  likelihood  model  9*(x)  for  the  whole  sample  space  xOf  is  given 
by  the  Akaike  information  criterion  (AIC)  defined  by 


AIC(k)  =  -2lnp(x,6k(x))  +  2K(k) 


(2-10) 


where  K(k)  is  the  dimension  of  Hk,  i.e.,  the  number  of  parameters  estimated.  The  Minimum 
AIC  Estimate  (MA1CE)  proposed  by  Akaike  (1973,1974b)  is  to  partition  the  sample  space  so  that 
Xk  is  the  set  of  sample  points  for  which 


Then  the  MAICE  estimate  is 


AIC(k)  <  AIC(j)  for  j  =  k 


^maice(x)  ~  ®*(*)  f°r  J^f * 


(2-11) 


(2-12) 


so  that  on  the  set  Xk,  0-A/CK(x)  is  the  maximum  likelihood  estimate  0t(x)  corresponding  to  the 
model  structure  k  with  minimum  AIC. 

For  autoregressive  models,  Shibata  (1981a)  has  studied  the  MAICE  and  other  asymptotically 
equivalent  procedures  for  model-estimation/order-determination.  He  adopted  a  spectral  measure 
of  accuracy  that  is  asymptotically  equivalent  to  the  negentropy.  He  showed  that  asymptotically 
for  large  sample,  MAICE  minimi***  the  negentropy  measure  of  accuracy  (2-8),  which  will  be 
called  entropy  tfficiency.  Hence  MAICE  is  asymptotically  an  optimal  procedure  for  choosing 
autoregressive  models.  Shibata  (1981b)  also  shows  MAICE  as  asymptotically  optimal  for  regres¬ 
sion  problems  which  involve  nonnested  multiple  comparisons.  Other  procedures  for  model  order 
determination  have  been  proposed  (Bhansali  and  Downham,  1977;  Schwarz,  1978)  which 
emphasize  the  choice  of  true  model  order  asymptotically  for  large  samples,  which  is  called  order 
consistency.  In  most  real  problems  the  true  order  is  infinite,  and  even  if  such  a  fiction  were  to 
exist,  a  predictive  criterion  is  much  more  intuitive  in  most  applications.  Shibata  (1983)  has  shown 
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that  order  consistency  aad  entropy  efficiency  are  mutually  exclusive  so  that  a  choice  is  required 
as  to  which  of  these  criteria  is  most  important.  In  particular,  Shaba ta  has  shown  that  an  order 
consistent  procedure  cannot  be  entropy  efficient,  and  that  an  entropy  efficient  procedure  will  not 
be  order  consistent. 

Turing  now  to  a  different  general  problem,  that  of  exact  small  sample  inference,  predictive 
inference  and  entropy  provides  a  new  approach  to  the  problem.  Past  approaches  to  the 
samp:  iference  problem  have  involved  a  number  at  ad  hoc  procedures.  The  entropy  measure 
provides  a  sound  fundamental  measure  of  the  approximation  error  in  predicting  the  density  of  the 
future  experiment.  One  of  the  past  approaches  has  involved  the  estiimuiv*  method  where  the 
predictive  density  is  restricted  to  lie  in  the  class  of  densities  to  contain  the  true.  Recent 

results  have  shown  that  the  use  of  more  general  predictive  densities  can  give  more  optimal  results 
as  measured  in  terms  of  the  entropy  measure  of  model  approximation  error  (Murray,  1979,  1977; 
Aitchison,  1975).  The  optimal  predictive  density  has  been  derived  in  the  class  of  invariant  densi¬ 
ties  minimising  the  entropy  measure.  This  was  derived  before  the  of  the  entropy 

measure  based  upon  the  sufficiency  principle.  As  shown  in  Chapter  6,  this  more  general  and 
optimal  predictive  density  can  be  considerably  better  as  measured  by  the  entropy. 

In  time  aeries  analysis,  the  advantage  of  the  approach  using  predictive  inference  and  entropy 
is  that  it  provides  a  sound  theoretical  framework  in  terms  of  model  approximation  for  the  direct 
comparison  of  very  general  time  series  analysis  models  including: 

a  Consideration  of  many  complex  hypotheses 

a  Comparison  of  nonnested  hypotheses 

a  Comparison  of  dynamic  models  of  different  dynamic  (state)  or  den 

a  Consideration  of  models  fitted  over  different  data  sets  for  detecting 
abrupt  changes 

a  Consideration  of  different  adaptation  rates  for  doing  optimal  model 
tracking 

The  comparison  of  such  diverse  models  is  inherent  in  adaptive  time  series  analysis  and  abrupt 

> 

change  detection,  and  previous  investigations  have  not  had  available  such  a  sound  and  general 
framework  for  solving  these  difficult  problems. 


3.  CONTAINED  NONNESTED  MULTIPLE  COMPARISON  OF  MODELS 


In  this  chapter,  the  general  problem  of  nonnested  multipie  comparison  is  considered.  In 
order  to  develop  a  general  theory,  consideration  is  restricted  to  comparing  models  that  are  the 
result  of  constrained  maaimum  likelihood  estimation.  The  objective  of  the  discussion  is  to  gen* 
eralias  the  currently  available  procedures  for  nonnested  multiple  comparison  in  the  constrained 
wiiimum  likelihood  context. 

The  approach  is  to  view  the  fitting  of  each  alternative  parametric  model  form  as  an  approxi¬ 
mation  procedure,  which  includes  the  notion  that  the  true  model  is  in  general  not  contained  in 
the  class  of  parametric  models  considered  in  the  model  fitting.  This  is  a  departure  from  previous 
approaches  that  involve  primarily  asymptotic  arguments  where  the  parametric  models  approach 
the  true  model  as  the  — "T**  sin  becomes  large.  Such  an  asymptotic  argument  begs  the  question 
of  model  approximation  since  asymptotically  there  is  no  error  in  the  approximation.  It  is  very 
important  in  practice  to  determine  the  extent  to  which  the  asymptotic  approximations  are  accu¬ 
rate  in  moderate  or  small  samples. 

Another  area  of  weakness  in  available  approaches  is  the  assumption  of  nesting  in  comparing 
models  using  entropy  methods.  The  derivations  of  Akaike  involve  the  assumption  of  nested 
models  which  considerably  restricts  the  applications  of  the  methods.  In  practice,  the  AIC  cri¬ 
terion  has  been  applied  in  a  much  wider  context  than  the  comparison  of  nested  models. 

The  results  of  this  chapter  will  provide  the  basis  of  much  more  general  decision  procedures 
for  adaptive  time  series  analysis.  In  these  problems,  the  comparison  of  different  models  based 
upon  different  intervals  of  data  are  compared  to  determine  if  an  abrupt  change  has  occurred. 
Previous  entropy  methods  have  only  compared  different  models  for  a  given  interval  of  data. 

The  first  remit  to  be  discussed  is  the  generalization  of  the  usual  maximum  likelihood  theory 
to  the  constrained  case.  The  regular  case  is  considered  where  the  log  likelihood  function  is 
expandable  in  a  Taylor  series  (Cox  and  Hinkle y,  1974,  p.  281).  These  conditions  permit  the  inter¬ 
change  of  expectation  and  differentiation. 

Following  the  notation  of  Larimore  (1986d,  contained  in  Appendix  A),  let  /(*,•)  denote  the 
log  likelihood  function  of  the  informative  sample  *  considered  as  a  function  of  the  parameters  9. 
Denote  by  S  the  expectation  with  respect  to  the  true  density  p(x  ,0)  with  true  parameter  0.  The 
negative  entropy  measure  is  used  as  the  measure  of  approximation  to  the  true  density  by  an 
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approximating  density  p(*,0*)  with  parameter  value  e*  from  the  subspace  6*  of  parameters.  The 
projection  ft  of  I  onto  the  subspace  04  is  defined  as  the  parameter  value  0*  minimizing 

*,(•.«*)  -  «(* .«)  -  El(x ,0*)  (3-1) 

so  that  the  projection  0*  satisfies  the  condition 

El  (x#)-  0,  (3.2) 

where  '  denotes  the  derivative  with  respect  to  0* .  The  minimum  is  unique  if  and  only  if  the  Hes¬ 
sian  D%  given  by  £>*  -  £/"(*  ,0*)  is  positive  definite.  Thus  for  a  constrained  class  of  models,  the 
projection  of  the  true  parameter  value  defines  the  best  approximation  to  the  true  density  in  the 
class  of  approximating  densities. 

Consider  now  the  constrained  maximum  likelihood  estimate  0*  in  the  subspece  of  parame¬ 
ters  0|  satisfying  the  likelihood  equation 

/•(*.*) -0  (3-3) 

Then  under  the  regularity  conditions,  we  have  for  a  positive  definite  Hessian  D*  and  asymptoti¬ 
cally  for  large  informative  sample  x  that 

•  0*  is  an  unbiased  estimator  of  0* 

•  the  estimation  error  covariance  matrix  is 

£(0*-0*X0l-«l)r  -  (Dfr'E{rT(x#)T(x&))(D!rl  (S4) 

For  the  unconstrained  case,  the  middle  term  is  the  Fisher  information  matrix  and  is  equal  to 
minus  the  Hessian  D%,  but  in  the  general  constrained  case  this  is  not  true. 

Now  consider  the  likelihood  /(yjx,0)  of  the  predictive  experiment  y  conditioned  on  the 
informative  experiment  x.  From  the  above  results,  the  negentropy  can  be  easily  determined. 
Fjcpanding  the  log  likelihood  to  second  order  and  taking  expectation  gives  the  negative  entropy  as 

**,(«.«*)  — +  **,(8.9*)  (3-5) 

which  holds  asymptotically  for  large  informative  sample.  Note  that  the  second  term  is  exact  with 
no  approximation  in  small  samples.  The  first  term  is  an  approximation  involving  the  variation  of 
the  maximum  likelihood  estimate  locally  about  the  projection  0* .  Thus  the  bias  pan  of  the  error 
in  constraining  the  model  is  captured  exactly. 
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la  the  previous  discussion  of  entropy,  the  measure  is  considered  as  a  measure  of  approxima¬ 
tion  error  between  the  true  and  approximating  density.  In  practice,  the  true  density  is  unknown 
and  it  is  desired  to  obtain  an  estimate  of  the  negative  entropy  based  upon  the  observed  informa¬ 
tive  sample.  To  simplify  the  discussion,  the  case  of  x  and  y  independent  is  considered.  An  accu¬ 
rate  estimate  of  the  negative  entropy  was  first  obtained  by  Akaike  using  the  log  likelihood  as  an 
estimate  of  the  entropy  with  a  correction  for  the  bias.  The  Akaike  information  criterion  (A1C) 

A/C(k)  -  -2togp(x ,9*(x))  +  lK(k)  (34) 

was  derived  as  an  unbiased  estimate  of  the  entropy  where  K(k)  is  the  number  of  parameters 
adjusted  in  fitting  the  maximum  likelihood  estimates.  The  second  term  adjusts  for  the  bias  in 
estimating  the  entropy  using  the  informative  sample  and  adjusting  the  parameters  in  fitting. 
Akaike  (1973)  originally  derived  the  A1C  as  an  unbiased  estimate  for  the  relative  comparison  of 
the  prediction  error  in  compering  two  nested  models.  The  newing  is  »if>  important  in  the* 
derivation  because  the  models  are  not  only  nested  but  asymptotically  approach  the  true  model. 

In  the  more  general  case  of  constrained  maximum  likelihood  estimation,  a  difficulty  occurs 
in  the  estimation  of  the  negentropy.  Consider  as  above  the  case  of  x  and  y  independent  and 
identically  distributed.  As  derived  in  Appendix  A,  the  expected  log  likelihood  difference  of  the 
informative  sample  is 

£[/(*,§)  -  /(*,*)]  -  -rr {/  r(x,9*)f(x,®»)}  +  Ry(8&)  (3-7) 

In  the  unconstrained  case,  asymptotically  for  large  informative  sample  the  trace  term  is  equal  to 
the  number  of  parameters  estimated.  Unfortunately  in  the  general  constrained  case,  the  Hessian 
is  not  equal  to  the  Fisher  information  matrix.  The  trace  then  depends  upon  the  expectation  with 
respect  to  the  true  unknown  density  of  the  first  and  second  derivatives  of  the  log  likelihood  func¬ 
tion  at  the  projection  9* .  This  cannot  be  computed  in  general  since  the  true  parameter  is  unk- 


In  cases  where  the  Hessian  and  Fisher  information  are  equal  so  that  the  trace  is  equal  to  the 
number  of  parameters,  then  two  different  parametric  model  structures,  say  9*  and  8*  can  be  com¬ 
pared  using  (3-7)  as 

£{/(x,«*)  -  /(x.&O!  -  *  -  dim(9f)  +  £(9,^)  -  £(•,•*)  (34) 

which  is  equivalent  to  the  A1C.  The  derivation  for  constrained  maximum  likelihood  estimates 
makes  dear  some  of  the  assumptions  in  previous  entropy  methods.  The  major  difficulty  is  caused 
by  the  variation  of  the  Fisher  information  or  Hessian  as  the  parameter  values  change.  In  the 
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derivation  above,  the  issue  of  netting  does  not  trim. 


MNMMi 

In  the  case  of  netted  teats,  the  MAJCE  criterion  reduces  to  the  usual  generalized  likelihood 
ratio  (GLR)  teat.  The  threshold  and  resulting  probability  of  rejecting  the  null  hypothesis,  in.  the 
ate  of  the  test,  depends  upon  the  number  of  additional  parameters  in  the  more  general  model. 

la  comparing  two  hypotheses  H 0  and  fft,  the  AIC  criterion  is  to  choose  according  to  the 
sign  of  the  quantity 

AIC  (Ho)  -A/C  (WO-  -21og^~*  +  2[JT(0)  -  *(l)j  (3-9) 

P(*,r) 

The  AIC  criterion  in  the  nested  case  is  equivalent  to  the  decision  rule 

choose  H0  if  -21ogX  <  2[£(1)  -  JT(O)] 

choose  H ,  if  -2k>gX  2  2|AT(1)  -  JC(0»  (3-10) 


where  the  generalized  likelihood  ratio  K  is  defined  by 

x  ■  ibj'l 

Pi*.  *>) 
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The  threshold  2{iT (1)  -  JT(0)]  is  precisely  twice  the  number  of  additional  parameters  under  the 
hypothesis  W|. 

In  the  case  of  a  normal  class  of  densities,  the  size  a  of  the  test  is  easily  determined  since  the 
GLR  statistic  X  is  chi-squared  on  K(  1)  -  JT(0)  degrees  of  freedom  under  the  null  hypothesis  H0. 
Thus  a  is  given  as  the  solution  of  the  relationship 

XL  -  2"  (3-12) 

were  *  is  the  number  of  additional  parameters  and  X^  is  the  e  probability  point  of  the  cumula¬ 
tive  chi-squared  distribution  on  a  degrees  of  freedom.  Solving  for  a  as  a  function  of  a  gives  the 
size  of  the  test  as  a  function  of  the  number  a  of  additional  parameters  as  shown  in  Table  1.  Note 
that  the  traditional  a  levels  of  0.10,  0.05,  0.01,  0.0QS  and  0001  correspond  respectively  to  about  4, 
8, 16, 20,  and  30  additional  parameters.  It  has  been  known  for  a  long  time  that  in  composite  tests 
or  repeated  applications  of  simple  additions  of  a  single  variable  that  the  significance  level  is  con¬ 
siderably  reduced  such  as  in  step-wise  regression.  The  entropy  approach  makes  explicit  the 
change  in  the  size  a  of  the  test  with  the  number  of  additional  estimated  parameters. 


4.  TIME  SERIES  ANALYSIS  USING  ENTROPY  METHODS 


Methods  of  predictive  inference  and  entropy  offer  a  number  of  advantages  in  the  analysis  of 
rim*  aeries  not  available  in  other  methods.  In  this  chapter  the  basic  time  series  analysis  methods 
are  described,  while  in  the  following  chapter  adaptive  methods  for  time  series  analysis  are 
developed  using  predictive  inference  and  entropy.  First  the  topic  of  the  achievable  accuracy  of 
spectral  analysis  is  addressed  by  relating  the  entropy  measure  directly  to  a  relative  squared  error 
in  estimating  the  power  spectrum.  Following  this  is  a  discussion  of  the  Markovian  representation 
of  time  series  in  terms  of  state  space  models  which  will  be  very  useful  in  representing  time  vary* 
ing  models  of  time  series.  The  canonical  variate  analysis  approach  to  rime  series  is  then  described 
which  forms  the  basis  of  the  adaptive  time  series  analysis  methods  developed  in  the  following 
chapter. 

4J  Achievable  Spectral  Accuracy 

In  this  section,  the  informative  and  predictive  samples  will  be  denoted  by  u  and  v  respec¬ 
tively  to  allow  for  the  traditional  use  of  x  and  y  for  random  processes.  Consider  the  problem  of 
identifying  a  model  for  a  pair  (x,j,)  of  multiple  stationary  time  series  where  x,  and  y,  are  exo¬ 
genous  and  endogenous  time  series  respectively.  Consider  linear  stochastic  models  in  the  form  of 
a  linear  difference  equation 

y,  *  *1  +  *  <7,  +  r,  (4-1) 

H) 

where  h(t  ;•)  is  a  causal  linear  system  giving  the  response  in  y,  to  the  past  exogenous  inputs  x,, 
and  where  q,  is  white  noise.  Suppose  that  the  probability  density  of  the  proceai  is  parameterized 
by  f.  The  exogenous  variable  x,  will  be  considered  as  exactly  observed,  and  the  problem  of 
modeling  y,  conditional  on  xt  is  considered  so  that  prediction  of  y,  from  x,  based  upon  such  a 
model  is  the  principle  problem.  This  also  includes  the  problem  of  no  exogenous  variable  so  that 
only  y,  is  observed. 

We  wish  to  investigate  the  achievable  accuracy  in  estimating  a  model  for  the  process.  In 
particular,  the  entropy  measure  will  be  developed  to  obtain  the  relationship  between  the  number 
of  parameters  estimated  in  the  model  and  the  relative  squared  error  in  estimating  the  power  spec¬ 
trum.  An  example  illustrates  the  effect  on  the  spectral  estimation  error  due  to  the  particular 
class  of  parametric  models  used  in  the  identification  and  the  number  of  parameters  estimated. 
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The  predictive  inference  setup  of  Chapter  2  is  considered  where  the  primary  interest  is  in 
the  asymptotic  behavior  with  large  sample  size  of  both  the  informative  and  predictive 
Consider  an  observed  informative  sample  er“(*foT*  *  •  •  >*£•)£)  of  sis  N  used  to  estimate  the 
process  model,  and  similarly  consider  a  conceptual  predictive  sample  v  of  size  M  used  to  evaluate 
the  accuracy  of  the  estimated  model.  The  predictive  sample  is  assumed  to  be  identically  distri¬ 
buted  but  independent  of  the  informative  sample.  Consider  the  problem  of  inference  on  the 
parametric  class  {p(v  ,•),•&}  of  models  with  probability  densities  p(vjb)  based  upon  the  informa¬ 
tive  sample  a.  Consider  the  conceptual  repeated  — mp»wg  experiment  where  on  each  trial  the 
samples  e  and  v  ate  each  drawn  independently  from  the  process  S  («,•)  with  •  to  be  the 

true  parameter  value.  An  estimative  model  ^-p(v,$(i<))  is  chosen  for  the  density  of  v  by  some 
parameter  estimation  scheme  6(e) .  For  a  stationary  process,  the  negative  entropy  (2-6)  is  linear  in 
the  predictive  sample  size  U ,  so  it  is  more  useful  to  consider  the  per  sample  negentropy.  To  this 
end,  define  the  per  sample  negentropy  denoted  As  derived  in  Appendix  B,  the  I- 

divergence  is  given  asymptotically  by 

nsj)  -  to-i E.  J  lo,- 

p(*  .•(**)) 

-  j  e. 

+  7 1.  /<MV l»(«)  -  /)(«»„ (»X«(»)  -  rf(.)D^  («) 

where  Em  denotes  expectation  relative  to  the  informative  sample  u. 

In  the  multiple  time  series  case,  the  spectral  measure  (4-2)  has  an  intuitive  interpretation  in 
terms  of  principal  components  of  the  power  spectrum  in  the  frequency  domain.  Principle  com¬ 
ponent  representations  of  the  spectral  matrices  5vv(w)  and  Sa(»)  have  the  form 

J(m)  S„(«)y*(<-)  -  D(«)  ,  L(m)  Sa(«)  £*(«)  -  £(«)  (4-3) 

when  J(m)  and  L(m)  given  u  a  function  of  frequency  <o  are  unitary  matrix  transformations  so 
J (•)/*(»•)-/ which  diagonalize  Ja( w)  and  S^O*)  respectively  and  where  £>(«)  and 
£(«*)  are  diagonal  matricies.  Ritering  x(i)  with  transfer  function  L(«)  gives  the  principal  com¬ 
ponent  process  j(i)  which  is  expressed  in  the  frequency  domain  as  X(m)-L(m)X(m)  ,  and  which 
has  the  diagnnal  spectral  matrix  £(<*),  and  similarly  for  q(t). 

The  spectral  measure  (4-2)  is  shown  in  Appendix  B  to  be 


I  ou(“)l :  dm 
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The  first  Mini  on  the  right  hand  side  is  the  integrated  squared  relative  error  of  the  estimated  cos¬ 
pectra  of  the  principal  components,  while  the  second  term  is  the  integrated  squared  coherency  of 
the  ewimated  spectrum  D(m)  which  would  be  zero  if  D(w)-D(w) .  Thus  the  measure  (4-2)  has  a 
dear  interpretation  in  the  multivariate  case  when  the  true  spectrum  D(u)  is  diagonal  but  where 
the  apprarimating  spectrum  D(«s)  is  permitted  arbitrary  coherency  among  components.  The  third 
term  in  the  spectral  measure  (4-4)  is  asymptotically  equivalent  to  replacing  5n(w)  by  SB(w).  This 
term  is  invariant  to  the  unitary  transformations  J(m)  and  L(m)  where  G  («)-/  (MW  (")£.  *  («)  is  the 
transfer  function  H(m)  expressed  in  the  coordinate  frame  of  the  principal  component  aeries  x(r) 
and  y(i).  The  squared  magnitude  error  |  Cy(«)  -  C^(w)| 2  in  the  ij  element  of  the  transfer  func¬ 
tion  is  weighted  by  the  input  signal  to  output  noise  ratio  D  (»•)„/£  («)7;  for  the  pair  (ij). 

The  spectral  measure  of  accuracy  can  be  bounded  in  terms  of  the  number  of  parameters 
estimated.  Suppose  first  for  simplicity  that  the  parametric  class  of  models  contains  the  true  pro¬ 
cess  and  that  k  parameters  are  estimated.  Then  by  Appendix  B  using  the  Cramer-Rao  lower 
bound,  the  per  sample  negentropy  is  bounded  by 


Em  l(S  J)  -  Em  ^-(e  -  if  F(«  -  •)  *  2^  nF  'F  m 


(4-5) 


with  equality  achieved  asymptotically  for  large  informative  sample  N .  This  implies  the  bound  on 
the  achievable  accuracy  in  spectral  estimation  given  by 
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*7*.  /irtf-W* (•)  -  Sw(-)J>2£ 
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In  the  more  general  case  where  the  order  is  infinite  and  the  MA1CE  procedure  for  choosing 
model  order  k  is  used,  then  Shi beu  (1983)  has  shown  the  following  result.  For  each  informative 
sample  size  N  there  is  an  optimal  order  k'(N)  which  minimizes  the  tradeoff  between  variability 
and  bias  in  the  entropy  measure 

*(•.•*)  "  -  7 II  •*-*!!  Jj  +  *(•.•*)  -  +  A(M*) 
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Then  asymptotically  for  larfe  informative  sample  N ,  the  negative  entropy  of  the  MAICE  model 
selection  procedure  is  exactly  that  of  the  model  selection  procedure  using  a  fixed  number  of 
parameters  equal  to  the  optimal  order  k\  Thus  even  in  the  case  of  using  an  entropy  efficient 
model  selection  procedure  where  the  true  model  order  is  infinite,  the  achievable  accuracy  of 
spectral  estimation  (4-6)  can  be  bounded  by  the  function  (4-7)  of  the  sample  size  N  with  k  -4*. 

To  illustrate  the  use  of  the  lower  bound  on  the  achievable  accuracy,  consider  the  ARMA 
(43)  process 

y,  -  13136  y,_!  -  1.4401  y,_2  +  1.0919  y(_3  -  033527  y,^ 

+  w,  +  0.17921  w,_,  +  032020  w,_2  +  026764  w,_3  (43) 

with  the  ndise  variance  of  w  as  Q  -  1.7258  xlO-2.  This  process  was  analyzed  by  Gerach  and  Sharp 
(1973)  and  Akaike  (1974b)  to  show  the  increased  accuracy  of  ARMA  models  over  AR  models. 
For  a  sample  size  of  800,  the  optimal  order  was  found  to  correspond  to  k’  »  18  for  fitting  an  AR 
model  to  the  data.  Akaike  (1974b)  fitted  several  models  to  data  using  AR,  ARMA, 

and  Hanning  window  methods.  Figure  1  shows  the  variability  term  of  the  spectral  error  as  a 
function  of  frequency  for  the  various  model  fitting  procedures.  Since  the  optimal  order  was  used 
for  the  AR  model,  the  bias  is  also  included.  The  ARMA  model  has  no  bias  sine-,  the  full  order  is 
chosen  with  high  probability.  On  the  ocher  hand,  the  Hanning  window  has  significant  bias  since  a 
fixed  bandwidth  is  used  to  spectrally  smooch  the  data  at  all  frequencies,  and  this  bandwidth  is  not 
sufficient  to  estimate  the  sharp  peak  without  bias.  Use  of  a  wider  bandwidth  would  increase  the 
already  large  error  at  all  the  ocher  frequencies.  The  AR  and  ARMA  methods  are  clearly  adap¬ 
tive  in  that  the  methods  have  a  greater  bandwidth  to  accommodate  the  rapid  changes  (large 
second  derivative  or  curvature)  near  the  peaks  and  troughs  but  lower  bandwidth  in  regions  with 
law  curvature.  The  greater  parametric  efficiency  of  the  ARMA  method  is  clearly  depicted  in 
these  regions  of  low  curvature.  Repeated  simulation  of  the  time  series  data  from  the  ARMA(43) 
model  and  the  maximum  likelihood  estimation  of  ARMA  models  with  MAICE  confirms  that  the 
lower  bound  indeed  gives  an  accurate  description  of  the  spectral  estimation  error  in  practice  (Lar- 
imore,  Mahmood,  and  Mehra,  1984).  Independent  methods  have  been  developed  for  obtaining 
simultaneous  confidence  bends  on  spectral  estimates  (Larimore,  1986c).  Such  confidence  bands 
are  proportional  to  the  integrand  of  the  spectral  measure  of  accuracy  so  that  the  integrand  gives 
an  accurate  measure  of  spectral  error  at  each  frequency  as  well. 


FREQUENCY  (HZ) 


Figure  1.  Spectral  Accuracy  of  Different  Model  Fitting  Procedures  (lower  curves)  in  Approxi¬ 
mating  the  True  Spectral  Density  (upper  curve). 
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4J  Markovian  Models  of  Time  Scries 

In  this  section  Markovian  or  state  space  models  of  time  series  are  reviewed.  Such  models 
have  not  been  widely  used  in  time  series  analysis,  although  there  is  wide  spread  use  of  such 
models  in  filtering  and  prediction  with  numerous  applications  in  engineering.  State  space  models 
have  a  number  of  advantages  in  time  series  analysis  that  are  attractive  for  automatic  implementa¬ 
tion  on  microprocessors  using  the  canonical  variate  analysis  method  discussed  in  the  next  section. 
Such  procedures  allow  the  automatic  selection  of  model  state  order  using  entropy  methods  and 
lend  themselves  to  adaptive  methods  for  time  varying  processes  discussed  in  the  next  Chapter. 

The  starting  point  of  any  approach  is  the  joint  probability  distribution  of  the  past  and  future 
observations ,•)  where  p,  are  the  past  inputs  u,  and  outputs  y,  up  to  time  t  and/,  are  the 
outputs  y,  in  the  future  at  time  r  defined  by 
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pj  ■  ( •  •  •  .  fJ  =  (yZ+i  «y/+2 »  • )  (4-9) 

and  8  is  a  vector  of  parameters  indexing  the  model.  A  fundamental  property  of  a  Markov  process 
of  finite  state  order  is  the  existence  of  a  finite  dimensional  state  x,  which  is  a  linear  function 

*,  -  Cp,  (4-10) 

of  the  past  p,.  The  state  x,  has  the  property  that  the  distribution  of  the  future  /,  conditioned  on 
the  post  p,  is  identical  to  that  of  the  future  f,  conditioned  on  the  finite  dimensional  state  x,  so 

p(ft\Pt$)  (4-11) 

Thus,  only  a  finite  amount  of  information  from  the  past  is  relevant  to  the  future  evolution  of  the 
process. 

A  stationary  Markov  process  of  some  particular  state  order  can  be  represented  by  a  vector 
difference  equation  with  the  general  form  (Lindquist  and  Pavon,  1981) 

x,+1  =  <fcr,  +Gu,  +w,  (4-12) 

y,  -  Hx,  +  Au,  +  Bw,  +  v,  (4-13) 

where  u  is  an  input  vector  process,  y  is  the  output  vector,  x  is  the  state  vector,  and  w  and  v  are 
white  noise  processes  that  are  independent  with  covariance  matrices  Q  and  R  respectively.  The 
matrices  <b.  A,  B ,  G ,  and  H  determine  the  dynamics  of  the  process  and  correlational  characteris¬ 
tics  of  the  disturbances.  The  various  matrices  are  considered  as  functions  of  the  parameters  8 
specifying  the  process.  The  white  noise  processes  model  the  covariance  structure  of  the  error  in 
predicting  y  from  u . 

For  time  series  analysis  and  system  identification,  the  parameterization  of  the  model  is  an 
important  issue.  The  elements  of  all  of  the  matrices  of  the  state  space  model  (4-12)  and  (4-13) 
and  noise  covariances  are  not  independent  parameters  of  the  model.  In  fact  for  each  distinct  pro¬ 
bability  distribution  there  is  an  equivalence  class  of  models  of  the  form  (4-12)  and  (4-13)  with  the 
same  distribution.  It  can  be  shown  (Candy,  Bullock,  and  Warren,  1979)  that  the  number  of 
independent  parameters  is 

K (4 )=2 kn  +n(»*  +1)  2+km  +nm  (4-14) 

where  k,  n,  and  m  are  the  vector  dimensions  of  the  state  x, ,  outputs  y, ,  and  inputs  u,  respectively. 
If  there  is  no  instantaneous  feedforward  so  A  =0,  then  the  term  nm  is  deleted,  while  if  there  is  no 
input  so  A  Mj  =-0  the  terms  km+nm  are  deleted. 
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The  state  yw  parameterization  (4-12)  and  (4-13)  is  not  unique,  however  in  the  next  section 
a  well  conditioned  procedure  for  selecting  a  unique  model  for  the  equivalence  class  will  be 
described.  For  an  individual  state  space  model  there  exists  a  corresponding  ARMA  model  and 
visa  vena.  However  the  two  classes  are  not  equivalent  as  classes.  In  general  there  is  no  ARMA 
class  of  models  equivalent  of  a  particular  state  space  class.  The  ARMA  class  has  one  major  diffi¬ 
culty  •  there  is  no  global  parameterization  of  the  state  space  models  of  a  given  order.  The  diffi¬ 
culty  is  in  the  ARMA  representation  which  becomes  singular  at  certain  models  such  as  one 
involving  the  cancellation  of  a  pole  and  a  zero.  This  causes  great  difficulty  in  numerical  methods 
in  attempting  to  automatically  identify  higher  order  models  which  may  involve  such  cancellations 
of  poles  and  zeros. 

The  major  advantage  of  the  state  space  models  is  the  availability  of  efficient  and  numeri¬ 
cally  well  conditioned  procedures  for  model  identification  discussed  in  the  next  section,  and  the 
explicit  Markov  structure  allows  for  the  the  development  of  direct  adaptation  procedures 
developed  in  the  next  chapter. 

43  Canonical  Variate  Analysis  of  Time  Series 

The  canonical  variate  analysis  method  for  identification  of  state  space  time  series  models  is 
described  in  this  section.  The  methods  for  the  determination  of  the  state  order  and  selection  of 
the  state  using  concepts  of  canonical  variate  analysis  are  first  discussed.  The  determination  of  the 
state  space  model  is  then  computed  by  simple  regression.  The  computation  involves  primarily  a 
singular  value  decomposition  of  the  sample  covariance  matrix  of  the  process. 

A  generalization  of  the  canonical  variate  analysis  method  has  recently  provided  a  completely 
general  solution  to  the  static  reduced  rank  stochastic  prediction  problem  which  is  well  defined 
statistically  and  computationally  even  when  some  or  all  of  the  various  covariance  matrices  are 
singular  (Larimore,  1986b).  All  other  previous  methods  in  the  statistical  literature  do  not  address 
the  general  problem.  This  result  is  the  foundation  of  the  time  series  analysis  methods  using 
predictive  inference  and  entropy  including  the  adaptive  time  series  methods. 

The  original  development  of  the  canonical  correlation  analysis  method  of  mathematical 
statistics  was  by  Hotelling  (1936;  see  also  Anderson,  1958).  The  application  of  canonical  variate 
analysis  to  stochastic  realization  theory  and  system  identification  was  done  in  the  pioneering  work 
of  Akaike  (1974a,  1975,  1976).  This  initial  work  has  a  number  of  limitations  such  as  no  system 
inputs,  no  additive  measurement  noise,  substantial  computational  burden  involving  numerous 
SVD's,  a  heuristic  set  of  decisions  for  choosing  a  basis  for  representation  of  the  system,  and  a 
number  of  approximations  including  computation  of  the  AIC  criterion  for  decision  on  model 
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Some  important  generalizations  and  improvements  in  Akaike's  canonical  correlation  method 
have  recently  been  made  by  Larimore  (1983b).  These  include  generalization  to  systems  with  addi¬ 
tive  measurement  noise  and  with  inputs  including  feedback  controls.  A  major  departure  of  the 
approach  from  previous  work  is  the  use  of  a  single  canonical  variate  analysis  to  optimally  choose 
k  linear  combinations  of  the  past  for  prediction  of  the  future.  The  very  natural  measure  of  qua- 
dratically  weighted  prediction  errors  at  possibly  all  future  time  steps  is  used.  Formulated  as  such 
a  prediction  problem,  it  is  shown  how  a  generalized  canonical  variate  analysis  gives  the  solution 
explicitly.  The  interpretation  of  canonical  variates  as  optimal  predictors  is  central  in  motivating 
interest  in  such  a  problem  formulation  and  is  scarcely  found  in  the  statistical  literature  (Larimore, 
1986b).  The  optimal  * -order  predictors  are  not  in  general  recursively  computable,  but  the 
optimal  state-space  structure  for  approximating  them  is  expressed  simply  in  terms  of  the  canonical 
variate  analysis.  The  problem  of  finding  an  optimal  Hankel  norm  reduced  order  model  (Adam- 
jan  et  al,  1971;  Kung  and  Lin,  1981)  is  related  to  the  canonical  variate  approach  (Camuto  and 
Menga,  1982;  Larimore,  1983b).  The  balanced  realization  method  is  a  particular  case  of  the  gen¬ 
eralized  canonical  variate  analysis  (Desai  and  Pal,  1984). 

To  more  concisely  discuss  the  canonical  variate  method,  the  results  in  Larimore  (1983b, 
1986b)  are  briefly  reviewed.  Consider  the  problem  of  choosing  an  optimal  system  or  model  of 
specified  order  for  use  in  predicting  the  future  evolution  of  the  process.  As  in  Section  42,  con¬ 
sider  the  past  p,  of  the  inputs  u,  and  outputs  y,  before  time  t  and  the  future  of  the  outputs  y,  at 
time  t  or  later  so 

pJ  *  (  '  ‘  •  .  fJ  ■  (ft+irf+s*  ' )  (4-15) 

We  assume  that  the  procesres  u,  and  y,  are  jointly  stationary  and  denote  the  covariance  matrices 
among/,  and  p,  “2//»  V'  and  V 

The  major  interest  is  in  determining  a  specified  number  k  of  linear  combinations  of  the  past 
p,  which  allow  optimal  prediction  of  the  future  /,.  The  set  of  *  linear  combinations  of  the  past 
P,  are  denoted  as  a  kxl  vector  m,  and  are  considered  as  1 -order  memory  of  the  past.  The 
optimal  linear  prediction  /,  of  the  future  /, ,  which  is  a  function  of  a  reduced  order  memory  m, , 
is  measured  in  terms  of  the  prediction  error 

fill/,  -/Ml  I*)  -  E{(f, -fJVffif,  -/,)}  (4-16) 

where  £  is  the  expectation  operation  and  t  denotes  the  pseudoinverse  of  a  matrix.  The  optimal 
prediction  problem  is  to  determine  an  optimal  k  -order  memory 
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*»  “  hPt  (4-17) 

by  chooring  the  4  rows  of  Jk  such  that  the  optimal  linear  predictor  /,(«,)  baaed  on  m,  minimis 
the  prediction  error. 

As  derived  in  Larimore  (1986b),  the  solution  to  this  problem  in  the  completely  general  case 
where  the  matrices  s//»  and  Iff  may  be  singular  is  given  by  the  generalized  singular  value 
decomposition  as  stated  in  the  following  theorem. 

Tleeseai  1.  Consider  the  problem  of  choosing  k  linear  combinations  m,  -  Jkp,  of  p,  for 
predicting  /,  such  that  (4*16)  is  minimized  where  2^  and  Iff  are  poaribty  singular  positive  sem- 
idefinite  symmetric  matrices  with  ranks  m  and  X  respectively.  Then  the  existence  and  uniqueness 
of  solutions  are  completely  characterized  by  the  (2^  Xff  )-generalized  singular  value  decomposi¬ 
tion  which  guarantees  the  existence  of  matrices  J,  L,  and  generalized  singular  values  yj,  •  ■  •  ,yr 
such  that 

mU  .  LlffLT  ,  Jl^fL7  •  Dtog(yl*...*y„Ol . 0)  (4-18) 

The  solution  is  given  by  choosing  the  rows  of  Jk  as  the  first  k  rows  of  J  if  the  4-th  singular  value 
satisfies  yk  >  ykH.  If  there  are  r  repeated  singular  values  equal  to  yt,  then  there  is  an  arbitrary 
selection  from  among  the  corresponding  singular  vectors,  ie.  rows  of  J .  The  minimum  value  is 

min  ||  II  -  (1  -  7?)  +  *  •  *  +  (1  -  it)  (4-19) 

rmkVklmJl)  -  k  *// 

This  result  not  only  gives  a  complete  characterization  of  the  solutions  in  selecting  optimal 
predictors  <14  from  the  past  p,  for  prediction  of  the  future  but  the  reduction  in  prediction 
error  for  all  pomibte  selections  of  order  k  is  given  simply  in  terms  of  the  generalized  singular 
values.  This  is  of  great  importance  since  it  avoids  having  to  do  a  considerable  amount  of  compu¬ 
tation  to  determine  what  selection  of  order  is  appropriate  in  a  given  problem. 

The  generalized  CVA  method  allows  the  determination  of  the  fit  of  the  various  state  space 
models  and  the  selection  of  the  best  model  state  order  before  computation  of  the  state  space 
models.  Consider  the  general  case  of  identifying  a  state  space  model:  given  the  past  of  the 
related  random  process  «,  and  y, ,  we  wish  to  model  and  predict  the  future  of  y,  by  a  4-order 
state-space  structure  of  the  form  (4-12)  and  (4-13) 

x, -  <hx,  -K/u,  +w,  (4-20) 


y,  -  Hx^AUt+BWf+v, 


(4-21) 


Ia  the  computational  problem  given  finite  data,  the  past  and  future  of  the  process  are  taken  to  be 
finite  at  length  d  lags  so 

pj  ■  (y,-i.  •  •  •  Jferfi.  •  •  •  .  /»r  *  (y,r.  •  •  •  O',*)  (4-22) 

Akaike  (1976)  proposed  choosing  the  number  d  of  lags  by  learn  squares  autoregresrive  modeling 
using  recumve  least  squares  algorithms  and  choosing  the  number  of  lags  as  that  minimising  the 
AIC  criterion  discussed  below.  This  insures  that  a  sufficient  number  of  lags  are  used  to  capture 
all  of  the  statistically  significant  behavior  in  the  data.  This  procedure  is  easily  generalised  to 
include  the  case  with  inputs  a,.  The  generalized  SVD  of  Theorem  1  determines  a  transformation 
/  of  the  past  that  puts  the  state  in  a  canonical  form  so  that  the  memory  m,  -  Jp,  contains  the 
states  ordered  in  terms  of  their  importance  in  modeling  the  process.  The  optimal  memory  for  a 
given  order  k  then  corresponds  to  selection  of  the  first  k  states. 

In  order  to  decide  on  the  model  order  to  select,  the  Akaike  information  criterion  (2-10)  is 
computed  where  the  number  of  parameters  is  determined  from  (4-16).  Once  the  optimal  k-order 
memory  m,  is  determined,  state-space  equations  of  the  form  (4-12)  and  (4-13)  for  approximating 
the  process  evolution  are  easily  computed  by  a  simple  multiple  regression  procedure  (Larimore, 
1963b). 

Since  the  CVA  system  identification  procedure  involves  the  state  space  model  form,  it  has 
the  major  advantage  that  the  model  is  globally  identifiable  so  that  the  method  is  statistically  well 
conditioned  in  contrast  to  ARMA  modeling  methods  (Gevers  and  Wertz,  1982).  Furthermore, 
since  the  computations  are  primarily  a  SVD,  they  are  numerically  stable  and  accurate  with  an 
upper  bound  on  the  required  computations  (Golub,  1969).  Thus  the  method  is  completely  reliable. 
It  has  been  demonstrated  as  such  in  the  time  series  analysis  software  Forecast  Master  that  is  com¬ 
mercially  sold  by  SSI.  From  the  theory  of  the  CVA  method  (Larimore,  Mahmood  and  Mehra, 
1964),  it  can  be  shown  that  there  are  no  difficulties  such  as  biased  estimates  caused  by  the  pres¬ 
ence  of  a  correlated  feedback  signal.  The  CVA  method  was  demonstrated  in  real  time  identifica¬ 
tion  and  adaptive  control  of  unstable  aeroelastic  wing  flutter  on  a  male  model  F-16  aircraft  in  the 
NASA  Langley  Transonic  Dynamics  Wind  Tunnel  in  February  1986. 


5.  ADAPTIVE  TIME  SERIES  ANALYSIS 


The  state  space  model  identification  methods  are  developed  in  this  chapter  for  adaptive  time 
series  analysis.  The  concepts  of  a  changing  Markov  process  are  first  discussed  along  with  con¬ 
cepts  of  a  piecewise  constant  model  of  the  process  that  is  constant  over  intervals  of  time.  The 
approach  to  adaptation  to  dow  changes  using  predictive  inference  and  entropy  is  described.  This 
leads  to  a  model  fitting  criterion  for  choosing  an  optimal  data  interval  that  balances  the  decreas¬ 
ing  sampling  variability  with  increasing  sample  aim  against  the  increasing  misamodeling  error  due 
to  use  of  a  constant  model  over  an  interval  of  data.  In  fitting  models  involving  abrupt  changes, 
the  models  fitted  over  various  intervals  are  compared  to  determine  if  an  abrupt  change  has 
occurred.  This  involves  the  comparison  of  models  determined  from  dam  on  different  date  inter¬ 
vals  in  predicting  the  error  on  a  different  interval.  Several  examples  are  given  in  using  the  pro¬ 
cedure  an  changes  involving  the  dynamics,  noise  excitation,  measurement  noise,  and  ocher 
changes. 

Concepts  of  adaptive  systems  have  been  around  since  the  1950'i  involving  various  senses  of 
adaptation.  The  present  literature  on  the  subject  includes  a  number  of  methods  such  as  recursive 
computational  schemes,  exponential  forgetting,  lattice  computational  methods,  etc.,  which  have 
certain  Ttnobs*  that  allow  tuning  of  the  algorithm  to  accommodate  changes  in  the  characteristics 
of  the  actual  processes.  Reviews  of  these  and  related  methods  are  contained  in  several  recent 
special  issues  of  technical  journals  and  books  (Special  Issue  on  Adaptive  Control,  Automatica, 
Voi.  20,  No.  S,  1985;  Special  Issue  oa  Linear  Adaptive  Filtering,  IEEE  Trans,  on  Information 
Theory,  Vot.  30,  No.  2,  1964;  Honig  and  Messerschmitt,  1984).  While  these  methods  do  permit 
some  degree  of  adaptation  to  process  changes,  the  methods  of  adaptation  are  ad  hoc,  and  no 
sound  underlying  statistical  principle  for  adaptation  is  proposed  or  demonstrated.  As  might  be 
expected,  these  methods  can  work  poorty  on  certain  cases  because  of  the  lack  of  a  sound  statisti¬ 
cal  basis. 

In  particular,  the  recursive  prediction  error  and  lattice  methods  are  convenient  due  to  their 
recursive  form  and  provide  an  estimate  at  every  observation  (Friedlander,  1982a,  1982b,  1983; 
Ljung  and  Soderstrom,  1983).  Also,  the  recursive  algorithms  can  be  used  for  adaptation  by 
exponential  weighting  of  the  past  data  (Wellstead  and  Sanoff,  1981;  Irving,  1979;  Evans  and  Betz, 
1982).  But  the  rational  for  exponential  weighting  has  not  been  given  a  sound  fundamental  justifi¬ 
cation,  but  is  used  largely  due  to  its  ease  of  use.  The  choice  of  the  exponential  weight  has  been 
ad  hoc  and  susceptible  to  misinterpretation  of  changing  noise  variance  levels  as  time  varying 
changes  in  the  dynamics  (Hagglund,  1983). 
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Adaptation  to  abrupt  changes  has  been  largely  discussed  in  the  fault  detection  literature.  A 
comprehensive  survey  of  fault  detection  methods  is  given  by  Willsky  (1976).  See  also  Mehra  and 
Peachon  (1971),  Willksy  and  Jones  (1974),  Willsky  (1980),  and  Isermann  (1984).  These  methods 
have  a  number  of  short  «*w«"ing»  in  detecting  changes  in  dynamics,  computational  illconditioa- 
ing,  and  excessive  computational  burden. 

The  central  computation  of  any  adaptive  algorithm  involves  the  extension  of  methods  for 
identification  of  stationary  time  series.  There  are  several  difficulties  with  currently  available 
methods  and  software  for  the  identification  of  system  dynamics  and  noise  characteristics.  Current 
methods  include  the  self  tuning  regulator  (STR)  (Ljung,  1983;  Astrom,  1973;  Astrorn  et  al,  1973, 
1977),  maximum  likelihood  estimation  (MLE)  (Mehra  and  Tyler,  1973;  Larimore,  1981a),  the 
Box-Jenkins  (BJ)  method  (Bon  and  Jenkins,  1976),  and  a  variety  of  heuristic  approaches.  The 
current  state  of  the  art  in  both  MLE  and  BJ  require  that  an  analyst  be  involved  in  the  procedure, 
and  the  required  number  of  computational  iterations  is  not  bounded.  The  STR  has  been  applied 
successfully  to  simple  processes,  but  is  not  completely  reliable  for  general  processes  particularly 
when  multi-input,  multi-output  systems  are  involved.  In  addition,  the  recursive  prediction  error 
algorithm  used  in  the  STR  requires  a  good  initial  estimate  and  so  is  not  suitable  for  short  date 
where  no  apriori  date  is  available.  The  heuristic  approaches  tend  to  be  for  special  purposes  and 
are  rather  unreliable  in  general  applications. 

The  problem  of  modeling  changing  processes  involves  primarily  two  types  of  changes 

•  changes  that  are  slow  compared  to  the  date  interval  used  for  identif¬ 
ication 

•  abrupt  changes  occurring  infrequently  compared  to  the  date  interval 
used  for  identification. 

If  the  changing  process  changes  too  rapidly  or  the  abrupt  changes  occur  too  frequently  relative  to 
the  date  interval  required  for  sufficiently  accurate  identification,  then  it  is  not  possible  to 
separate  die  actual  system  changes  from  the  variability  due  to  sampling. 

Consider  a  time  varying  Markov  process  where  the  conditional  probability  of  the  future 
given  the  past  depends  upon  time  i  so 

(5-1) 

where  the  state,  defined  as  linear  combinations  of  the  past  p, ,  varies  with  time  as 


x,  -  C,p, 


(5-2) 


la  enter  to  upwa  tha  *»«—  evolution  of  the  sute  of  a  Markov  process  in  terms  of  a  system  of 
state  space  equations  of  the  form 

*,+i  -  ♦,*,  +  G,u,  +  w,  (5-3) 

y,  -  H,x,  +  A,*,  +  B,w,  +  v,  (54) 

where  the  various  matrices  are  time  varying,  it  is  necessary  and  sufficient  that  the  conditional  dis¬ 
tribution  of  the  future  /,  conditioned  on  the  state  x,  have  the  form 

(5-5) 

This  condition  is  essentially  that  the  information  in  the  state  x,  for  predicting  the  future  f,  is  con¬ 
tained  in  the  state  xf_1  delayed  by  one  time  step  and  the  present  inputs  a,  and  outputs  y,.  This 
condition  is  satisfied  by  most  physical  systems  since  the  memory  is  stored  as  energy  in  physically 
describabie  states.  If  the  system  changes  abruptly,  there  may  also  be  an  abrupt  change  in  the 
input  or  a  large  noise  innovation  v,  associated  with  a  significant  change  in  the  state  of  the  system. 
Formulated  in  this  way,  it  is  apparent  how  the  sute  space  modeling  methods  are  particularly  use¬ 
ful. 

In  any  modeling  method  based  upon  a  finite  sample  of  dau,  only  a  finite  number  of  param¬ 
eters  can  be  determined  which  are  much  fewer  in  number  than  the  number  of  dau.  Of  the  vari¬ 
ous  ponabie  methods  for  modeling,  the  simplest  and  least  presumptive  is  the  piecewise  constant 
model  which  is  constant  over  various  intervals  of  dau.  Thus  consider  the  model  of  the  form  (5-3) 
and  (54)  with  piecewise  constant  parameters  0, 

*»+i  “  %*•  +  G,*,  +  w,  (5-6) 

>»  "  Htx,  +  A,u,  +  B, w,  +  v,  (5-7) 

The  coefficient  matrices  G„  Ht,  A,,  and  B,  are  functions  of  the  parameters  0,  which  are  con¬ 
stant  over  an  interval  of  time  T,  and  change  from  one  time  interval  to  another. 

In  the  following  sections,  adaptive  time  series  analysis  methods  are  developed  by  considering 
various  hypotheses  concerning  slow  and  abrupt  changes.  The  predictive  inference  and  entropy 
methods  provide  a  means  of  objectively  comparing  the  vast  multitude  of  such  hypotheses  entailed 
in  the  adaptive  time  series  analysis  problem. 


TIm  problem  of  to  stow  variations  it  primarily  that  of  determining  the  length  of 

wn  ioaagval  to  um  in  the  dme  varying  model  (54)  and  (5-7).  Contider  the  di  virion  of  a  taction 
of  data  into  2*  tubintervalt  of  length  2*  templet  were  h  and  l  are  an  integers.  Then  the  variout 
hypotheses  can  be  **■—**■—«<  such  at  H, :  divide  the  interval  into  tubintervalt  of  length  2 1 .  For 
each  tubinterval  lj,  for  /'«  1,2,...,2*,  suppose  a  state  space  model  denoted  it  fitted  tiring  the 
CVA  method  with  A1C  used  to  select  the  best  mode!  state  order. 

By  tuccetrive  application  of  the  Markov  property,  the  joint  probability  density  of  the  obser¬ 
vations  conditioned  on  the  initial  state  is  given  by 

lofp(r, . Y* I  Y0,9k)  -  XitvOOl  (54) 

where  *  (if . •£)  is  the  parameter  vector  for  the  composite  model  consisting  of  all  of  the 

models  over  the  2*  subintervals.  This  gives  the  tog  likelihood  as  the  sum  of  the  conditional  log 
likelihoods  on  each  subinterval. 

Consider  the  entropy  measure  of  the  composite  model  B* .  Using  the  asymptotic  approxima¬ 
tion  (5-5),  the  negentropy  is 

*(&,&)  -  ^  +  *($,$*)  -  It  +  *(•.•')]  («) 

where  B*  and  are  understood  to  denote  the  parameters  constrained  under  the  respective 
hypotheses  involving  estimation  of  these  models.  Note  that  there  is  a  tradeoff  between  the  first 
term  which  increases  with  finer  subdivisions  of  the  data  and  the  second  term  which  decreases  as 
the  number  of  parameters  is  increased  with  finer  subdivisions  of  the  data.  A  minimum  of  the 
negentropy  defines  the  optimum  subdivirion  of  the  data. 

To  the  negentropy  from  the  sample,  the  A1C  is  used.  From  the  definition  of  AJC, 

the  A1C  coneeponding  to  (5-9)  is  given  by 

AIC(Y} . K^.d*)  -  XAIC(Yj,V)  (5-10) 

The  optimal  data  length  is  chosen  as  the  *  minimising  the  above  AJC. 

As  an  illustration  of  this  scheme,  it  was  applied  to  a  problem  in  the  1978  Workshop  on  Spec¬ 
tral  Estimation  (Gerhardt,  1978)  in  estimating  the  instantaneous  frequency  of  a  sine  wave  with 
time  varying  frequency  in  the  presence  of  interference  and  random  noise.  The  data  were 
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SAMPLE 

NUMBER  OF  POINTS  IN  DATA  SUBINTERVAL 

TIMES 

16 

32 

64 

128 

0 

22 

38 

KA 

12.40 

■■ 

12.12 

12.15 

1239 

■1 

70 

86 

102 

118 

m 

11.66 

12.12 

MB 

1237 

12.64 

1252 

1253 

(1228)  1205 

(12.17)  1227 

(12.16)  1236 

(1236)  1236 

190 

166 

182 

198 

214 

230 

246 

1201 

1204 

1226 

1266 

1152 

1203 

1229 

1106 

11.98 

12.02 

11.47 

(1205)  1206 

(12.03)  11.79 

(12.17)  12.06 

(12.40)  12.40 

278 

1203 

1225 

1102 

llfl 

1139 

XJA 

1108 

11.43 

11.76 

342 

398 

171 

1208 

■■ 

1155 

12.08 

390 

(11.74)  1066 

(11.70)  1165 

(11.78)  11.79 

(11.75)  11.75 

ALL  DATA 

1202 

11.67 

1204 

12.17 
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The  **<««««  of  the  instantaneous  frequency  wu  chosen  as  the  maximum  of  the  spectrum 
obtained  from  the  CVA  model  fitted  to  the  data.  The  estimated  instantaneous  frequency  is  given 
in  Table  3  along  with  the  ocher  three  best  solutions  obtained  by  the  other  participants  in  the 
workshop. 


SAMPLE 

TIME 

TRUE 

WILEY  AND 
CARMICHAEL 

WEINER 
ET  AL 

ADAPTIVE 

CVA 

MAXIMUM 

ENTROPY 

32 

141.47 

141.47 

14139 

142.13 

141.7 

64 

145.73 

145.73 

14538 

145.13 

1462 

96 

15237 

15237 

152.69 

15222 

152.6 

128 

160.73 

160.73 

160.71 

16135 

1603 

160 

17030 

170.00 

17022 

16930 

169.9 

192 

17927 

17927 

179.48 

178.40 

179.9 

224 

187.63 

187.63 

187.72 

18836 

1882 

256 

19427 

19427 

19433 

193.90 

1943 

288 

198.53 

19833 

19733 

19722 

1983 

320 

20030 

200.00 

199.72 

200.94 

2013 

352 

198.53 

19833 

19739 

19835 

1983 

384 

19427 

19427 

193.93 

194.66 

1943 

416 

187.63 

187.63 

18739 

187.99 

187.4 

448 

17927 

17927 

17837 

17932 

178.9 

480 

17030 

170.00 

169.64 

17021 

1703 

512 

160.73 

160.73 

160.11 

164.0 

Table  3.  Instantaneous  Frequency  Estimates. 

The  best  solution  by  Carmichael  and  Wiley  (1978)  uses  a  special  zero  crossing  method  that  is 
applicable  only  to  pure  sine  waves  so  that  it  will  not  generalize  to  more  general  spectra. 

The  adaptive  CVA  method  did  about  as  well  as  the  best  of  the  methods  other  than  Wiley 
and  Carmichael,  and  much  better  than  a  lot  of  them.  Note  that  the  adaptive  CVA  approach 
makes  no  assumptions  about  the  form  of  the  spectrum  or  the  character  of  the  time  variation. 
Also  the  adaptive  CVA  method  is  completely  automatic,  and  in  this  example  did  not  involve  any 
considerations  by  an  analysis  to  determine  the  choke  or  modification  of  the  computations.  As 
measured  in  terms  of  the  estimated  instantaneous  frequency,  the  method  did  very  well. 


The  primary  problem  in  adaptive  time  series  analysis  is  determining  if  and  when  an  abrupt 
change  has  occurred.  This  problem  reduces  to  comparing  two  intervals  of  data  and  determining 
if  the  same  process  model  is  a  better  description  of  the  observations  than  a  different  model  for 
each  interval  of  data.  A  complication  of  the  problem  is  that  the  exact  time  is  unknown  and  must  ! 

be  determined  from  the  data.  This  invoivei  the  comparison  of  a  multitude  of  hypotheses  con¬ 
cerning  the  posrible  time  of  occurrence  of  the  abrupt  change.  In  addition,  the  ability  to  detect 
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the  abrupt  chance  depends  upon  the  data  length  of  the  data  intervals  used.  The  best  data  length 
for  detection  depends  upon  the  type  of  abrupt  change  since  some  changes  affect  the  observations 
immediately  and  the  effect  decreases  rapidly,  while  for  other  changes  the  effect  on  the  observa¬ 
tions  takes  some  time  before  it  is  apparent.  Thus  the  consideration  of  different  data  lengths 
requires  additional  hypotheses  to  be  considered  and  compared.  The  predictive  inference  and 
entropy  methods  give  a  sound  basis  for  the  comparison  of  the  multitude  of  nonnested  hypotheses. 

Consider  the  problem  of  determining  if  there  is  a  change  in  the  the  process  between  two  dis¬ 
joint  data  subintervals.  The  detection  problem  considered  is  where  the  process  is  modeled  as  a 
slowly  changing  process  using  some  efficient  procedure  such  as  given  in  the  previous  section.  The 
notation  of  the  previous  sections  will  be  used  with  subscript  1  or  2  corresponding  to  the  data 
intervals  Y  x  from  the  past  slowly  changing  models  and  Y2  from  the  new  data  that  is  to  be  com¬ 
pared  for  detection  of  an  abrupt  change.  The  subscripted  parameters  8X  and  62  with  or  without 
other  superscripts,  hats,  or  tildes  will  denote  models  based  upon  the  corresponding  data  interval. 
The  data  lengths  of  the  two  intervals  Yx  and  Y2  are  generally  different  with  the  first  data  interval 
determined  by  the  slow  adaptation  method  and  with  the  second  set  usually  much  shorter  and  of 
variable  data  length  since  the  best  data  length  for  detection  of  abrupt  changes  is  not  known.  For 
any  selection  of  the  two  intervals,  we  wish  to  determine  if  there  has  been  a  significant  departure 
in  the  process  characteristics  between  the  two  data  sets. 

Ideally,  the  hypothesis  (61,62)  that  the  models  are  different  over  the  two  subintervals  Yx  and 
Y  2  verses  the  hypothesis  6 ;  that  the  model  is  the  same  over  the  joint  data  set  (T1,lr2)  would  be 
compared.  But  this  would  involve  a  considerable  number  of  comparisons.  To  avoid  such  numerous 
comparisons,  consider  the  following  approximation.  Let  data  set  Yx  be  chosen  as  the  most  recent 
optimal  length  interval  preceding  Yz  with  corresponding  model  6j  which  provides  a  near  optimal 
prior  model  Yx.  To  detect  any  abrupt  changes  in  the  system,  consider  the  approximation  of  using 
the  model  61  as  an  approximation  to  the  joint  model  6*. 

As  discussed  in  Section  43,  consideration  can  be  limited  to  conditional  models  given  the 
pa«  or  equivalently  the  initial  state  at  the  beginning  of  the  subinterval.  Using  such  conditional 
models,  the  likelihood  function  reduces  to 

P(Ti.y2.6i)-F(Jfl.6l)P(lr2.61)  (5*12) 

The  model  on  the  first  data  set  Y:  is  the  same  tot  both  the  above  hypothesis  and  the  change 
hypothesis  p(Yx, The  entropy  measure  is  the  difference  of  the  above  two  log  likeli¬ 
hoods  which  involves  only  the  likelihoods  on  the  second  subinterval  Yz  so  that 


*(Mi)  -  f[‘ogp(K2,6l)  -  logp(K2.62)] 


(5-13) 


If  the  system  in  fact  had  an  abrupt  change  between  Yx  and  Y2,  then  since  the  data  length  Yx  is 
much  longer  thu  y2,  moat  of  the  information  in  detecting  the  abrupt  change  is  in  the  comparison 
of  the  two  models  and  62  on  the  data  Y2. 

The  problem  is  now  to  obtain  an  estimate  of  the  entropy  measure  (4*13)  from  the  observa¬ 
tional  data.  The  observed  log  likelihood  is  used  as  an  estimate  of  the  entropy  as  in  (3-7).  The  bias 
of  this  estimate  of  the  entropy  measure  is 

E[l(YJ  -  l(YJ)  -  £[/(§)  -  /(Kj)]  -  £(/(§)  -  l(Y<)) 

=  - dim( e2)  +  £(§,^2)  -  R(9,Yd  (5-14) 

where  the  term  dim(&)  in  (3-8)  is  not  present  since  the  estimate  81  is  a  function  only  of  the  sam¬ 
ple  Kj  which  is  conditionally  independent  of  the  sample  Y2.  Thus  an  unbiased  estimate  of  the 
difference  of  negentropies  R(9,Y£  -  R(b,Y{)  of  the  two  models  is 

l(Yd  -  1(Yt)  +  duntf)  (5-15) 

This  gives  a  test  for  the  occurrence  of  an  abrupt  change  between  the  two  data  intervals. 
Depending  upon  the  nature  of  the  change  and  the  process  characteristics,  the  best  detection 
interval  will  vary.  Some  changes  give  most  of  the  information  about  the  change  over  a  short 
interval  while  others  have  a  cumulative  effect  and  require  a  long  time  interval  to  detect. 

Consider  as  an  example  of  the  procedure  for  detecting  abrupt  changes  the  ARMA(4,3) 
model  (4-8).  Three  types  of  abrupt  changes  were  simulated  including  an  abrupt  change  in  the 
dynamics,  in  the  state,  and  in  the  variance  of  the  excitation  noise  wt.  The  results  of  the  pro¬ 
cedure  for  detecting  abrupt  changes  for  the  case  of  no  change  and  cases  of  a  simulated  abrupt 
change  in  the  dynamics,  the  excitation  noise,  and  the  state  are  shown  in  Tables  4,  5,  6,  and  7 
respectively.  In  each  case,  the  entropy  measure  for  detecting  the  abrupt  change  was  computed 
over  various  intervals  of  data  of  lengths  30,  100,  and  200  samples  and  the  abrupt  change  occurred 
at  the  sample  time  325.  In  the  case  of  no  abrupt  change,  the  entropy  measure  shown  in  Table  4 
is  on  the  average  0.5186  with  a  standard  deviation  of  0.016.  As  shown  in  Table  5,  the  largest 
value  of  the  entropy  measure  is  in  fact  in  the  interval  samples  300-330  containing  the  time  of  the 
abrupt  change  in  dynamics.  The  following  interval  of  samples  350-400  also  indicates  a  large  value 
of  the  entropy  measure.  The  initial  large  value  in  interval  300-350  is  due  to  a  transient  in  the 
state  as  it  settles  to  a  new  steady  state  variance  which  is  largely  complete  in  interval  350400.  The 
entropy  measure  then  persists  at  this  value  in  following  intervals.  The  abrupt  change  in  noise 
variance  shown  in  Table  6  has  a  different  character.  The  negentropy  changes  abruptly  in  interval 
300-350  and  remains  at  that  value  in  succeeding  intervals.  With  an  abrupt  change  in  the  state 


NUMBER  OF  POINTS  IN  DATA  SUBINTERVAL 


Table  4.  Value  of  Per  Sample  A1C  for  Subintervals  of  the  Sample  with  No  Abrupt  Change. 


Table  5.  Value  of  Per  Sample  AJC  for  Subintervals  of  the  Sample  with  an  Abrupt  Change  in 
Dynamics  at  Sample  325. 


shown  in  Table  7,  the  departure  is  largely  confined  to  the  interval  300-350  although  the  transient 
has  not  quite  died  out  in  the  interval  350-400. 

In  all  cases  there  was  no  assumption  as  to  the  nature  of  the  change,  and  the  procedure 
works  as  well  on  state  jumps,  changes  in  noise  variances,  or  other  changes.  Note  that  the  charac¬ 
ter  of  abrupt  change  is  quite  different  depending  on  the  type  of  abrupt  change  that  occurs.  The 
best  detection  of  abrupt  changes  can  only  be  achieved  by  an  adaptive  procedure  that  considers 
the  multitude  of  data  intervals  and  selects  a  near  optimal  data  length  for  detection.  These  initial 
results  on  the  adaptive  detection  procedure  demonstrate  that  it  is  very  sensitive  to  a  variety  of 
different  abrupt  changes  in  the  model. 


SAMPLE 

NUMBER  OF  POINTS  IN  DATA  SUBINTERVAL 

D 

100 

200 

200 

250 

300 

350 

400 

m 

0503 

0513 

n  CTO 

0.653 

0.653 

Uj/7 

0.654 

Table  6.  Value  of  Per  Sample  A1C  for  Subintervals  of  the  Sample  with  an  Abrupt  Change  in  Ex¬ 
citation  Noise  Variance  at  Sample  325. 


SAMPLE 

NUMBER  OF  POINTS  IN  DATA  SUBINTERVAL 

50 

100 

200 

2 «r 

250 

300 

350 

400 

0.495 

0503 

0513 

21342 

85.766 

43.180 

0595 

Table  7.  Value  of  Per  Sample  A1C  for  Subintervals  of  the  Sample  with  an  Abrupt  Change  in 
State  at  Sample  325. 


6.  SMALL  SAMPLE  MULTIVARIATE  ANALYSIS 


The  approach  to  «mall  sample  inference  in  this  Chapter  is  the  use  of  the  entropy  measure  of 
model  approximation  error  to  evaluate  the  performance  of  small  sample  methods.  This  general 
approach  is  based  on  the  justification  of  entropy  as  the  natural  measure  of  model  approximation 
error  as  developed  in  Chapter  2.  The  historical  Bayesian  predictive  inference  approach  plays  a 
major  role  in  providing  a  computable  predictive  density  which  is  subsequently  shown  to  be 
optimal  in  terms  of  the  entropy  measure.  This  optimality  is  established  by  considering  best  invari¬ 
ant  predictive  densities.  For  the  multivariate  normal  family,  several  predictive  densities  are  com¬ 
pared  with  the  best  invariant  to  show  the  large  improvements  that  are  possible  in  small  samples. 

(.1  Baysrfaa  Predictive  Inference 

The  historical  approach  to  predictive  inference  involves  the  use  of  Bayesian  concepts  and 
methods  to  determine  the  predictive  density.  Consider  the  parameterized  class  of  probability  den¬ 
sity  functions 

F  =  {p(yj |  0)*ce}  (6-1) 

defined  on  the  joint  sample  space  (X,f).  The  predictive  inference  setup  as  in  Section  2.1  is  con¬ 
sidered  where  a  predictive  density  pa(y|  x)  is  to  be  constructed  as  an  approximation  to  the  true 
conditional  density  p(y|  x,0)  for  the  unknown  parameter  value  0.  In  the  Bayesian  approach,  the 
predictive  density  is  constructed  on  the  basis  of  an  assumed  prior  density  p(0)  on  the  parameters  0 
using  Bayes  rule. 

From  Bayes  Theorem,  the  posterior  density  of  the  parameters  is  given  by 


p(«l*) 


P(^)p(x\  0) 


where  the  marginal  density  of  x  is 


P(x)  ■  Jp(8)p(*I  8>*8 


The  Bayesian  predictive  density  pb{y\  x)  is  then  given  by  computing  the  marginal  density  using 
the  posterior  so 

PaO-l  *)  -  Jp(yl  x,0>(0|x)d0  (6-i) 

This  approach  is  direct  and  simple  although  the  assumption  of  a  prior  density  p(0)  on  the 
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paramctcri  9  is  bothersome  from  both  a  theoretical  and  practical  point  of  view. 

A  major  objection  to  the  Bayesian  approach  is  the  use  of  an  arbitrary  prior  density  on  the 
parameters  to  express  ignorance.  If  a  uniform  density  is  used  for  the  prior  on  6,  then  a  nonsingu¬ 
lar  transformation  of  the  perimeters  to  a  new  set  6  -  g( 9)  and  use  of  a  uniform  prior  on  6  in 
general  produces  a  different  posterior  density.  Thus  there  is  a  certain  arbitrary  choice  of  the 
parameterization  and  resulting  posterior.  A  way  around  this  is  the  use  of  nonutformtuivt  prior 
densities.  Such  densities  give  posterior  densities  that  are  invariant  to  transformations  of  the 
parameter  space  (Jeffreys,  1961;  Box  and  Tiao,  1973).  In  situations  where  a  noninformative  prior 
exists,  it  can  be  obtained  in  terms  of  the  Fisher  Information  matrix. 

In  recent  years,  some  intriguing  connections  between  the  bayesian  predictive  density  and 
concepts  of  entropy,  frequentist  methods,  invariant  methods,  and  noninformative  priors  have 
been  made.  Still  in  a  strictly  Bayesian  context,  consider  the  negentropy  measure  (2-6)  applied  to 
the  Bayesian  predictive  density  (64).  Of  course  a  Bayesian  would  take  expectation  of  this  meas¬ 
ure  with  respect  to  the  unknown  parameters  9  which  will  be  called  the  Bayts  Bisk.  Using  (64)  and 
interchanging  the  order  of  integration,  the  Bayes  Risk  between  any  two  predictive  densities 
Pi(y I  *)  andp2(yj  x)  is 

^,.*Oi(yl  -O  ; P2Cvl  *))  *  JpWJp(* I  »)/p(y| *,e)  lQg^j dy 

,  ,  p,(y|x) 

=  fp(x)dxfpt,(yl  X)  leg-  dy  (6-5) 

Now  setting  P)(y|  x)  »  pb(y\ x)  guarantees  that  the  Bayes  Risk  between  pb  and  any  predictive 
density  p2  “  nonnegative  and  zero  if  and  only  if  p2(y| x)  *  pb(y\ x). 

Thus  in  a  Bayesian  context,  the  Bayesian  predictive  density  is  optimal  in  terms  of  the  Bayes 
Risk,  i st.  the  expected  negative  entropy  measure  with  the  expectation  also  taken  over  the  param¬ 
eters  9.  As  was  noted  by  Aitchison  in  the  original  derivation  of  this  result  for  the  case  of  x  and  y 
independent,  there  are  a  number  of  interesting  cases  where  the  negative  entropy  measure,  ic.  the 
Bayes  Risk  excluding  expectation  over  the  parameters  9,  is  not  a  function  of  the  parameters  9.  In 
such  cases  the  Bayesian  predictive  density  is  optimal  in  a  frequency  sampling  sense  where  there  is 
a  fixed  unknown  true  parameter  value  and  the  negative  entropy  (2-6)  is  used  as  the  measure  of 
error.  This  topic  is  discussed  in  the  next  section. 


6-3 


In  several  particular  cases,  the  optimal  predictive  density  has  been  found  that  minimis  the 
negative  entropy.  Murray  (1977)  considers  the  claa  of  d-dimensional  multivariate  normal  densi¬ 
ties  with 

p(y\ *2)  -  (2ir)^|  2|  -Wexp{-|(y-^r,(y-^)}  (66) 

In  this  case  Aitchison  and  Ounsmore  (1973,  p.  29)  show  that  using  the  nooinformative  prior  den¬ 
sity  proportional  to  |  Z| _1,  the  Bayesian  predictive  density  is  the  d-dimensional  Student  distribu¬ 
tion 


P,(yl*)  -5^h-l,»i,(a+lX«-irli] 


(6-7) 


where  the  d-dimensional  vector  z  is  Std(kJ>,c)  if  it  has  density  function 


p(z) _ CKLtlM _ 

^{(k^d  +1V2}|  kc\  ~v2{l+{x-bf(kc)-1(x-b)){k  +1V2 


(66) 


This  Bayes  predictive  density  was  shown  (Aitchison,  1975)  to  result  in  the  negative  entropy  not  a 
function  of  the  unknown  parameter  6..  It  is  thus  also  optimal  in  the  frequency  sense. 

This  same  result  was  derived  by  Murray  (1977)  using  invariance  concepts.  Consider  the  class 
G  of  invariant  predictive  densities  pa(yj  z)  that  are  invariant  to  translations  and  linear  transforma¬ 
tions  of  the  sample  x.  In  this  class,  the  best  invariant  predictive  density  p,(y\  x)  was  shown  to  be 
(6-7)  which  gives  a  constant  value  of  the  negative  entropy  independent  of  the  value  of  the  true 
parameter  value  6..  This  gives  a  strictly  frequentist  interpretation  of  the  Bayes  predictive  density. 
A  stronger  result  reported  very  recently  (Levy  and  Perng,  1986)  is  the  minimality  the  negentropy 
for  the  best  invariant  predictive  density  p/(y J  x)  uniformly  in  the  unknown  parameters 
0.  *  (p.,2.)  among  any  predictive  density  in  the  class  G  of  invariant  predictive  densities. 

(J  Campari—  of  Entropy  for  Multivariate  Normal 

To  illustrate  the  usefulness  of  the  predictive  inference  approach  using  negentropy,  the 
results  for  the  multivariate  normal  distribution  are  given  below  in  terms  of  the  relative  odds  of 
the  likelihood  ratio.  Consider  the  case  (6-6)  of  the  multivariate  normal  density  Nd(tk£).  Here 
three  methods  are  compared:  the  estimative  method  using  the  predictive  density 
Pe(yi x )  ”  ^(y >£(*)’£(*))  where  the  maximum  likelihood  estimates  (i(x)  and  £(x)  are  used,  the 
best  normal  with  estimates  (ji,(n +  1X"-* -2)‘ll)  which  minimize  the  negentropy  in  the  class  of 
normal  densities,  and  the  Bayesian  predictive  density  which  is  identical  to  the  best  invariant 


> 
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predictive  density. 

The  expected  negentropies  ere  shown  in  Tebie  8  for  the  above  three  predictive  densities. 


Predictive 

Number  of  Observations  (n) 

Densities 

4 

6 

11 

14 

20 

SO 

1-Dimensional 

Estimative 

1.191 

0366 

0.130 

0.094 

0.060 

0.021 

(2-48) 

(1.170) 

(1.037) 

(1.121) 

(1.009) 

(1.001) 

Best 

0.476 

0226 

0.103 

0.079 

0.053 

0820 

Normal 

(1-21) 

(1.048) 

(1.009) 

(1.006) 

(1.002) 

(1.000) 

Best 

0282 

0.180 

0.094 

0.083 

0.051 

0.020 

Invariant 

8-Dimensional 

Estimative 

- 

- 

3687 

8.08 

285 

0.61 

(  w14 ) 

(221.4) 

(3819) 

(1127) 

Best 

- 

• 

6.79 

3.15 

1.63 

030 

Normal 

(8.93) 

(136) 

(1127) 

(1.010) 

Best 

- 

- 

4.60 

2.69 

131 

0.49 

Invariant 

Table  8.  Expected  Negative  Entropy  (and  the  Geometric  Mean  of  the  Likelihood  Odds  Relative 
to  the  Best  Invariant). 

In  compering  two  predictive  distributions,  the  relevant  quantity  is  the  difference  between  their 
negentropies  (2-5)  (Larimore  (1983a)).  The  exponential  of  this  quantity  is  the  geometric  mean  of 
the  relative  odds  of  a  sample  y  having  come  from  the  two  respective  predictive  distributions. 
This  exponential  of  the  negentropy  difference  is  also  given  in  parentheses  in  Table  8  for  the  best 
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normal  and  estimative  methods  relative  to  the  best  invariant  method.  Since  ap(a-b)  =  l+(a-6) 
for  o-h«l,  we  see  that  for  negentropy  differences  much  less  than  unity,  the  odds  of  an 
observed  d -dimensional  sample  y  coming  from  either  of  two  predictive  distributions  is  about 
equal.  For  the  negentropy  difference  near  unity,  these  odds  are  disproportionate  of  order  e  =2.7; 
and  if  it  is  much  greater  than  unit  the  odds  can  get  very  large.  Note  that  a  20  percent  increase  in 
the  negentropy  as  between  the  estimative  and  best  invariant  for  d  =1  and  n  =20  has  only  a  one 
percent  odds  advantage.  On  the  other  hand,  a  17  percent  increase  in  the  negentropy  as  between 
the  best  normal  and  best  invariant  for  d  =8  and  n  =14  has  an  odds  ratio  of  1.56.  This  emphasizes 
the  importance  of  comparing  the  negentropy  on  the  basis  of  the  arithmetic  difference  and  not  the 
relative  proportion.  Note  that  for  very  small  samples  the  relative  odds  ratio  can  be  much  larger 
than  unity  and  even  in  the  tens  or  hundreds.  Thus  there  is  a  huge  potential  gain  in  the  use  of 
predictive  inference  in  very  small  samples  as  has  been  noted  in  different  terms  by  Aitchison  and 
Dunsmore  (1975,  p.  231),  Aitchison  and  Kay  (1975)  and  Murray  (1979). 


7.  CONCLUSIONS  AND  RECOMMENDATIONS 


7J  C— cM— i  From  Phaa  I  Study 

In  this  Phase  I  SBER  study,  sutistical  methods  are  developed  using  predictive  inference  and 
entropy.  This  approach  has  a  strong  intuitive  appeal  as  a  result  of  the  justification  of  the  entropy 
measure  based  upon  the  predictive  inference  framework  and  the  fundamental  statistical  principles 
of  sufficiency  and  repeated  sampling.  This  approach  applies  to  a  wide  class  of  inference  problems 
including: 

•  general  inference  methods  such  as  parametric  or  nonparametric 
methods 

•  exact  evaluation  of  small  sample  procedures 

a  determination  of  model  order  or  structure  including  the  case  of  non¬ 
nested  multiple  comparison 

•  time  aeries  analysis  including  definition  of  optimal  tracking  of  time 
varying  processes  and  optimal  detection  of  abrupt  changes. 

The  entropy  measure  provides  a  fundamental  measure  for  the  comparison  of  alternative  sutistical 

procedures  and  provides  a  basis  for  developing  optimal  sutistical  inference  methods. 

In  this  study  a  number  of  particular  topics  were  addressed  from  the  predictive  inference  and 
entropy  perspective  including: 

•  sutistical  model  building  involving  the  determination  of  parametric 
model  structure  and  order  in  the  general  case  of  constrained  multi¬ 
ple  nonnested  alternatives, 

•  time  series  modeling  and  forecasting  involving  the  determination  of 
parametric  model  structure  and  order, 

•  adaptive  time  series  analysis  involving  optimal  methods  for  tracking 
slow  changes  as  well  as  for  detecting  abrupt  changes  or  failures, 

•  small  sample  inference  for  multivariate  distributions  of  the  exponen¬ 
tial  family. 

Some  major  results  were  developed  on  these  topics  that  demonstrate  the  feasibility  and  desirabil¬ 
ity  of  developing  sutistical  methods  using  predictive  inference  and  entropy. 

A  number  of  results  were  obtained  for  the  nonnested  multiple  comparison  problem  based 
upon  the  study  of  constrained  maximum  likelihood  estimates.  These  include: 


•  consideration  of  the  general  constrained  cue 

•  general  extension  of  Akaike's  AIC  procedure  to  constrained  non* 
nested  multiple  comparison  problems 

•  solution  of  the  general  constrained  case  requires  that  a  condition  on 
the  Fisher  information  and  Hestian  matrices  be  satisfied 

•  a  general  model  order  and  structure  selection  method  was  shown  to 
be  asymptotically  optimal. 

Previous  developments  consider  only  the  case  where  the  true  parameter  is  approached  asymptoti¬ 
cally  and  exclude  the  case  where  the  true  parameter  lies  outside  the  models  considered.  The  con¬ 
strained  case  investigated  in  this  study  gives  a  basis  for  viewing  the  predictive  inference  and 
entropy  method  u  model  approximation  when  the  models  are  restricted  and  asymptotically 
biased.  These  results  provide  a  basis  for  the  use  of  predictive  inference  and  entropy  on  the  gen¬ 
eral  time  series  analysis  and  adaptive  time  aeries  analysis  problems  involving  constrained  non¬ 
nested  multiple  comparison. 

Using  currently  available  methods,  the  time  series  analysis  problem  is  difficult  because  the 
parametric  model  structure  is  unknown  and  requires  the  fitting  and  comparison  of  many  different 
models.  Also  current  methods  are  numerically  and  statistically  ilkonditioned  for  some  models. 
The  approach  of  predictive  inference  and  entropy  provides  a  natural  solution  to  the  multiple  com¬ 
parison  problem.  The  results  obtained  using  the  predictive  inference  and  entropy  approach  for 
multivariate  time  series  analysis  include: 

•  Explicit  expressions  for  a  lower  bound  on  the  achievable  accuracy  in 
the  estimation  of  the  transfer  function  and  power  spectral  matrix 

•  This  lower  bound  applies  to  the  case  where  the  true  model  order  is 
unknown  and  a  model  order  determination  procedure  is  used. 

•  The  lower  bound  is  achieved  for  large  samples  using  maximum  likel¬ 
ihood  estimation  and  the  AIC  order  determination  procedure. 

An  example  of  the  estimation  accuracy  of  a  true  ARMA(43)  process  using  spectral  smooching, 

AR  model*  and  ARMA  model  fitting  show  the  considerable  difference  in  using  these  various 

methods. 

Markov  models  of  time  series  were  developed  as  a  basis  for  stable  time  series  analysis 
methods  using  the  canonical  variate  method  (CVA).  This  method  is  numerically  and  statistically 
stable  and  has  been  applied  recently  to  a  number  of  high  order  multivariable  time  series  analysis 
problems.  This  approach  provides  the  basis  for  adaptive  time  series  analysis  methods. 

Markov  models  of  time  series  with  changing  characteristics  were  developed  including  slowly 
and  abruptly  changing  processes.  Using  the  CVA  method  as  the  computational  method,  the 
entropy  methods  were  applied  to  adaptive  time  series  analysis: 
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•  StttMcil  method*  for  determining  the  optimal  data  length  for  adap¬ 
tation  to  slow  change*  were  developed. 

•  gutiatical  method*  for  choosing  the  optimal  time  interval  for  detec¬ 
tion  of  an  abrupt  change  in  the  proce**  were  developed. 

•  The  entropy  measure  is  optimally  sensitive  to  any  abrupt  changes 

changes  in  process  dynamics,  changes  in  the  excitation 
noise  levels,  and  jumps  in  the  process  state. 

These  results  demonstrate  the  feasibility  of  developing  adaptive  time  series  analysis  methods 
based  upon  entropy  methods  and  the  CVA  computations.  The  CVA  computations  have  been 
demonstrated  in  real  time  identification  of  multivariable  systems.  Thus  the  feasibility  of  adaptive 
time  series  analysis  in  real  time  has  been  demonstrated. 

Small  sample  methods  were  developed  using  the  predictive  inference  and  entropy  methods. 
The  justification  of  entropy  based  upon  the  sufficiency  and  repeated  sampling  principles  provides 
a  sound  justification  for  the  use  of  recently  developed  small  sample  methods  based  upon  entropy. 
The  Bayesian  method  was  extended  to  the  case  where  the  informative  and  predictive  experiments 
are  dependent.  The  theory  is  illustrated  for  the  multivariate  normal  distribution  using  the  Baye¬ 
sian,  best  invariant,  estimative,  and  best  normal  predictive  densities.  The  relative  measure  of 
approximation  is  shown  to  be  the  per  sample  relative  odds  ratio  which  is  the  exponential  of  the 
entropy  measure. 

7 2  Psntmmmif arise*  for  Author  Remarch  and  Development 

This  study  has  demonstrated  the  feasibility  and  usefulness  of  predictive  inference  and 
entropy  methods  particularly  in  the  areas  of: 

•  constrained  nonnested  multiple  comparison  of  models 

•  model  order  and  structure  determination  for  time  series 

o  modeling  of  changing  processes  using  Markov  model  structures 

•  optimal  adaptation  to  slowly  varying  processes  by  optimal  selection 
of  data  interval 

•  optimal  adaptation  to  abrupt  changes  of  unknown  type  at  unknown 
times  by  optimal  selection  of  the  detection  data  interval 

•  automatic  stable  computation  of  time  series  models  using  the  CVA 
method 

•  determination  of  lower  bounds  for  the  estimation  of  transfer  func¬ 
tions  and  power  spectra 

These  achievements  provide  a  bases  for  the  further  research  and  development  of  predictive  infer¬ 
ence  and  entropy  methods. 
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The  areai  of  greatest  promise  appear  to  be  those  of  adaptive  end  a oa adaptive  time  series 
analysis  for  the  following  reasons: 

•  the  number  of  potential  applications  to  DoD  systems  is  very  large 

e  time  series  analysis  and  adaptation  are  the  major  problems  in  adap¬ 

tive  control 

e  to  address  the  adaptive  time  series  analysis  problem  requires  an 
approach  that  deals  with  the  multiple  comparison  problem  in  a  fun¬ 
damental  way  that  is  offered  by  predictive  inference  and  entropy 
methods 

e  among  the  current  time  series  analysis  methods,  only  the  CVA 
method  is  suitable  for  real  time  solution  of  the  problem 

•  present  and  near  future  computers  are  capable  of  multivariable  iden¬ 
tification  in  leas  than  a  second  of  computation  for  high  order  systems 
of  doaens  of  states 

These  methods  have  been  demonstrated  to  be  feasible,  and  the  development  of  online  adaptive 
time  series  analysis  software  for  general  application  would  provide  an  enormous  capability  for 
DoD  systems.  Presently  there  are  no  other  known  approaches  that  will  achieve  this  goal.  Such 
adaptive  time  series  processors  would  allow  for  the  adaptation  of  systems  to  slow  and  abrupt 
changes  in  the  environment. 

The  topics  recommended  for  further  research  and  development  include: 

•  further  research  and  development  on  the  adaptive  time  series 
analysis  methods  for  adaptation  to  slow  and  abrupt  changes 

•  development  of  algorithms  for  implementation  of  the  adaptive 
methods  that  are  numerically  stable  and  accurate  and  are  statisti¬ 
cally  reliable 

•  prototype  algorithm  testing  to  demonstrate  the  accuracy,  reliability, 
and  computational  requirements  on  typical  DoD  problems. 

•  software  development  to  provide  modular,  documented,  and  verified 
software  in  one  or  more  general  programming  languages. 

The  achievement  of  these  objectives  would  provide  a  dramatic  improvement  in  the  performance 

of  adaptive  methods  and  the  availability  of  software  for  adaptation  in  DoD  systems. 
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1. 


The  federal  problem  of  choosing  a  model  from  among  a  multitude  of  alternative 
models  remains  one  of  the  difficult  problem  of  statistical  inference.  Traditional  methods 
of  statistical  hypothesis  testing  apply  directly  only  to  the  case  where  there  are  two 
hypotheses  under  consideration  and  one  is  a  subset  of  the  other,  i.e.,  the  two  hypotheses 
are  nested.  In  such  cases,  classical  methods  are  applicable  and  lead  to  well  understood 
results.  In  the  case  were  there  are  more  than  two  hypotheses  involved,  the  use  of  the 
classical  methods  are  not  well  defined  or  understood  even  in  the  nested  case  (see  for 
example  the  discussion  in  Anderson,  1971,  pp.  270).  Although  the  probability  of  rejecting 
one  hypothesis  in  comparing  any  pair  is  well  defined,  the  repeated  application  of  such 
pairwise  comparisons  results  in  a  test  whose  properties  are  not  understood.  In  the  case  of 
comparing  two  hypotheses  that  are  not  nested,  the  distribution  theory  is  available  but 
much  more  complicated  (Larimore,  1977). 

Beyond  these  difficulties  in  carrying  out  the  classical  procedures  is  the  lack  of  a  gen¬ 
eral  framework  for  formulating  and  solving  the  problem  of  nonnested  multiple  comparison 
of  constrained  models.  The  predictive  inference  approach  offers  a  predictive  measure  of 
the  accuracy  of  various  model  selection  procedures  that  apply  as  easily  to  the  case  of  non¬ 
nested  multiple  comparison.  The  adequacy  of  a  model  selection  procedure  is  measured  in 
terms  of  the  accuracy  of  the  selected  models  in  hypothetical  repeated  'future*  experi¬ 
ments.  This  is  very  attractive  in  the  context  of  scientific  inference  were  the  role  of  model 
building  is  to  provide  a  basis  for  prediction  of  the  future  behavior  of  a  phenomenon.  The 
entropy  measure  provides  a  most  sensitive  measure  based  upon  the  sufficient  statistic  as 
contained  in  the  likelihood  ratio.  The  derivation  of  the  entropy  as  a  measure  of  the  pred¬ 
iction  error  of  a  predictive  density  in  the  predictive  inference  framework  is  based  upon 
the  fundamental  principles  of  sufficiency  and  repeated  sampling  (Larimore,  1983).  This 
provides  a  strong  theoretical  basis  for  the  use  of  entropy  in  predictive  settings  of  scientific 
inference.  In  more  narrowly  defined  problems  of  quality  control  or  decision  theory 
involving  a  well  defined  lorn  function,  ocher  procedures  may  be  more  appropriate.  But  in 
a  predictive  scientific  setting,  the  approach  using  predictive  inference  and  entropy  seems 
much  more  justified. 

Consider  the  problem  of  choosing  among  a  multitude  of  model  structures  on  the 
basis  of  a  set  of  observations.  If  we  adopt  the  predictive  criterion  that  the  chosen  model 
should  be  the  best  in  a  predicave  sense  in  predicting  another  independent  sample  from 
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the  «*«"*  process,  then  the  optimal  choice  is  the  model  selection  procedure  with  the 
minimum  negentropy.  The  major  problem  is  the  practical  evaluation  of  the  negentropy 
measure  and  the  determination  of  efficient  procedures  that  come  close  to  minimising  the 
negentropy  measure. 

In  this  paper,  the  theory  of  inference  for  nonnested  multiple  comparison  is 
developed  in  the  context  of  predictive  inference  and  entropy.  To  develop  a  general 
theory,  the  case  of  maximum  likelihood  is  considered  for  moderate  and  large  samples. 
The  case  where  the  true  process  model  is  not  contained  in  the  models  considered  for 
inference  is  the  usual  situation  in  scientific  inference  since  even  the  most  general  model 
forms  usually  do  not  include  certain  complexities  such  as  nonlinearities,  nonstationarity, 
etc.,  that  may  have  a  small  effect  or  be  very  difficult  to  handle.  Previous  approaches 
involving  the  entropy  measure  have  not  explicit  included  this  misa-modeling.  It  is  shown 
that  this  miss-modeling  can  be  directly  considered  in  the  analysis.  The  resulting  theory  is 
very  attractive  in  that  is  gives  an  explicit  interpretation  of  the  predictive  inference 
approach  as  model  approximation  of  the  true  process  using  simplified  alternative  model 
forms.  The  entropy  methods  lead  to  procedures  that  select  models  that  in  the  predictive 
sense  are  the  most  accurate  in  approximating  the  true  process  model.  The  classical  diffi¬ 
culties  of  nonnested  and  multiple  comparison  do  not  arise  in  this  predictive  inference  set¬ 
ting. 

In  the  paper,  first  the  subject  of  constrained  maximum  likelihood  estimation  is 
developed  in  the  predictive  inference  and  entropy  context.  This  is  used  to  derive  the 
expected  negentropy  for  maximum  likelihood  estimates,  and  then  to  determine  an 
unbiased  estimate  of  the  entropy.  Finally,  bounds  on  the  achievable  accuracy  of  model 
selection  procedures  is  derived  that  depend  on  the  number  of  estimated  parameters  in  the 
model  fitting. 


2.  CenatnrisMd  Mstlnsn  tjwuiihih  Estimation 

In  this  section,  properties  of  the  maximum  likelihood  parameter  estimates  are 
developed  for  the  case  that  the  true  probability  model  is  not  contained  in  the  class  of 
parameterized  densities  that  are  considered  for  inference.  The  classical  development  of 
the  asymptotic  consistency  and  minimum  variance  of  maximum  likelihood  estimators  is 
for  the  case  where  the  true  density  is  contained  in  the  parametric  class. 
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The  predictive  inference  framework  as  in  Larimore  (1983)  is  adopted  here  with 
p(x,9)  the  parameterized  probability  density  where  0  is  a  vector  of  parameters,  x  is  the 
informative  sample  and  y  is  the  predictive  sample.  Suppose  that  the  parameter  vector 
er«(01,02,...)  is  a  finite  or  infinite  set  of  parameters,  and  for  each  subset  of  distinct  posi¬ 
tive  integers  *=(*!,...  ,kM )  consider  the  subspace  8t  of  0  such  that  only  the  correspond- 
ing  0t , . . .  .0*  are  nonzero  where  0*  denotes  a  member  of  8*.  and  let  Ck  be  the  the 
class  of  models  Cj«{p(x,0*),0*€8t}.  These  classes  of  models  are  in  general  nonnested  so 
that  we  do  not  in  general  have  C*CC,  or  C,  CC4 .  The  maximum  likelihood  estimator  for 
the  clam  Ck  will  be  denoted  as  0*(x). 

The  development  of  the  maximum  likelihood  theory  is  straight  forward  for  the  case 
where  Taylor  series  expansions  are  possible.  This  holds  under  the  following  regularity 
conditions  (Cox  and  Hinkley,  p.  281): 

(i)  The  parameter  space  is  closed  and  compact. 

(ii)  The  probability  distributions  defined  by  any  two  different  values  of  0  are  distinct. 

(iii)  The  first  three  derivatives  of  the  log  likelihood  /(x,0)  with  respect  to  0  exists  in  the 
neighborhood  of  the  true  parameter  value  almost  surely.  Further,  in  such  a  neigh¬ 
borhood,  n-1  times  the  absolute  value  of  the  third  derivative  is  bounded  above  by  a 
function  of  x ,  whose  expectation  exists. 

In  particular,  these  conditions  permit  the  interchange  of  expectation  and  differentiation 
up  to  second  order. 

In  the  discussion  various  order  models  are  considered,  and  the  relationships 
between  the  various  orders  is  developed.  The  log  likelihood  function  of  the  informative 
sample  x  will  be  denoted  by  /(x,0),  and  the  gradient  row  vector  and  Hessian  matrix 
denoted  / ‘(x  ,0)  and  /”(x,0)  respectively.  Expectation  ,  denoted  E,  will  be  with  respect  to 
the  true  density  p(x,9)  unless  stated  otherwise  where  0  denotes  the  true  parameter  value. 
Define  the  projection  0*  of  0  onto  8t  as  the  parameters  0*  €8*  minimizing  the  negentropy 
Rm  relative  to  the  informative  sample  x 

*,(0,0*)  -  E/(x,0)  -  El(x ,0*)  (2-1) 

At  the  minimum  0* ,  the  gradient  of  (2-1)  is  zero  so  from  the  regularity  conditions 


Er(x,0*)«O, 


(2-2) 
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and  the  minimum  is  unique  if  and  only  if  the  expected  Hessian,  denoted  £)*,  of  (2*1)  is 
positive  definite  in  an  open  neighborhood  of  the  minimum.  From  the  regularity  condi¬ 
tions,  die  Hessian  is  given  by  D*=£/"(x, 8*). 

To  determine  the  moments  of  the  maximum  likelihood  estimates  0*,  consider  the 
first  order  equality 

o  =  /•(*,§*)  =  /'(x,e*)  +  (2-3) 

Taking  expectation  with  respect  to  the  the  true  density  and  using  (2-2)  gives  the  equation 

D*(£0*-0*)  -  0  (2-4) 

that  holds  asymptotically  for  large  informative  sample  N .  For  0*  identifiable,  i.e.  0* 
unique,  D*  is  noosingular  which  implies  that  to  first  order 

£9*  -  0*  (2-5) 

Now  using  (2-3),  the  covariance  of  the  estimation  error  is 

£(0*-0tXe*-«t)r  =(DtV£{/'7(x,0»)r(x,0*)Kf)4V  (2-6) 

Note  that  in  the  unconstrained  case,  the  middle  term  which  is  the  Fisher  information 
matrix  is  equal  to  minus  the  expected  Hessian  £>*,  but  this  is  not  in  general  true  for  the 
constrained  case. 

3.  Expected  Negative  Entropy  for  Maximum  Likelihood  Estimates 

To  evaluate  the  expected  negative  entropy  for  maximum  Likelihood  estimates,  con¬ 
sider  the  predictive  inference  setup  as  in  Larimore  (1983).  The  general  case  of  depen¬ 
dence  between  the  informative  and  predictive  samples  is  considered.  The  expected  nega¬ 
tive  entropy  is  a  measure  of  the  degree  of  approximation  of  the  true  conditional  density 
P(y\  *.®)  by  the  predictive  density  p[y\  x,0*)  for  predicting  the  future  predictive  sample  y 
from  the  informative  sample  x.  The  expectation  will  be  taken  in  two  steps,  first  with 
respect  to  the  random  variable  y|  x  and  then  with  respect  to  x  In  this  section  the  likeli¬ 
hood  function  /(9*)  *  /(y|  x,0*(x))  is  for  the  predictive  sample  yjx,  and  the  maximum 
likelihood  estimator  0*(x)  is  on  the  informative  sample  x. 

Expanding  (2-1)  in  a  Taylor  series  gives  a  second  order  expression  for  the  informa¬ 
tion  distance  which  holds  asymptotically  for  large  sample  size  of  the  informative  sample, 


m*  - -I— — —  — ■—  * 
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i.c.  for  the  likelihood  estimate  0*  close  to  the  projection  0* 

,(M*(*))  *  £[/(0*)  -  /(o*)]  +  £[/(§)  -  /(§*)] 

-  -^[r(?X?  -  »*)]  -  -  0*)]  +  £[/(§)  -  /( o*)] 

--j£J|0*(x)-0a  lli  +^,(0,0*)  (3-1) 

since  0*(-r)  is  independent  of  /(y|  x,0*)  and  using  the  gradient  property  of  the  projection 
(2-2).  The  second  order  expansion  is  only  locally  in  the  estimation  error  0*-0*  about  the 
projection  0*  of  the  true  parameter  value  0  on  the  subspace  of  the  parameters  correspond¬ 
ing  to  the  model  O*.  Note  that  this  exprewon  gives  the  exact  bias  term  *^,(0,0*)  in  small 
samples  and  involves  no  approximation. 

4.  Unbiased  Estimation  of  Entropy 

For  decision  on  model  parametric  order  and  structure,  it  is  necessary  to  estimate 
the  negative  entropy  based  on  the  informative  sample.  One  such  procedure  is  due  to 
Akaike  (1973).  We  consider  the  case  where  the  informative  sample  x  and  the  predictive 

sample  y  are  independent.  For  each  selection  of  a  parameter  subset  k=(k  j . km),  the 

Akaike  information  criterion  for  comparing  the  maximum  likelihood  estimators  is 

A/C(*)  =  -21og p(x,P(x))  +  2 K(k)  (4-1) 

where  K(k)  is  the  number  of  parameters,  i.e.  the  dimension  of  0* .  The  minimum  AJC 
estimator  (MAICE),  denoted  9A(x),  is  dA(x)*&^\x)  where  k(x)  is  the  parameter  set 
minimizing  AlC(k).  The  A1C (k)  is  an  unbiased  estimator  of  the  negative  entropy  based 
upon  the  informative  sample  and  the  assumed  model  structure.  The  predictive  sample  is 
esKntially  replaced  by  the  informative  sample,  and  the  term  2K(k)  is  an  adjustment  for 
the  bias  due  to  the  correlation  between  the  informative  sample  x  and  the  estimate  0*(x). 

Following  Akaike,  we  use  the  maximized  log  likelihood  /,(0*)  -  /(x,0*(x))  as  an 
estimate  of  the  relative  entropy  and  compute  the  bias  in  the  procedure.  We  expand  the 
log  likelihood  function  as  in  (3-1)  except  that  below  the  likelihood  is  on  the  informative 
sample  so  that  there  is  dependence  between  l(x,¥)  and  0*(x).  In  particular,  using  (2-3) 

gives 
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-  £[/'(9*X«*  -  «*)]  -  £[(»*  -  -  e*)]  (4-2) 

Consider  expending  the  expected  log  likelihood  difference  as  in  (3-1)  but  using  (4-2)  as  a 
result  at  the  dependence.  Thus 

f  (/(«)  -  /(6*)]  -  E [/(§*)  -  /( 8*)]  +  £[/(0)  -  /(§*)] 

-  -E[/'(0*Xe*  -  §*)]  -  E[y(«*  -  Vfrfry?  -  0*)]  +  £[/(§)  -  /(©*)] 

-  £[(9*  -  e*)r/*  (®*xe^  -  e*)]  +  £(0,0*) 

-  -tr(D%)~lE{l  T(x ,e*V  (x ,9*)}  +  £(0,9*)  (4-3) 

where  the  third  equality  follows  using  the  expression  (3-1)  for  the  expected  negentropy. 
In  the  general  case,  the  bias  term  in  the  negentropy,  £(0,0*),  is  correctly  estimated  except 
for  the  trace  term.  Asymptotically,  if  9*  approaches  0,  then  the  matrix  of  the  trace  is  the 
identity.  This  is  the  case  considered  by  Akaike  (1973).  In  the  general  constrained  case, 
this  may  not  be  the  case  so  that  the  bias  depends  upon  the  unknown  true  parameter  value. 
What  is  required  is  a  restriction  such  as  the  Hessian  and  Fisher  information  matrix  being 
equal  globally  which  gives 

-tr(DfrlE{lT(x,V)l{x&))  =  trIK{k)  =  K(k)  (4-4) 

Consider  the  case  of  fitting  two  models  9*  and  ,  and  consider  the  expected  differ¬ 
ence  of  the  maximized  log  likelihoods 

£(/(9»)  -  l(V))  -  £(/(§)  -  /(6 ')]  -  Elm  -  /( 6*)] 

=•  +dwi(9*)-</im(9/) +£(0,0/)-£(0,0*)  (4-5) 

Thus  for  relative  comparisons  among  hypotheses  based  on  a  given  sample,  an  unbiased 
estimate  of  twice  the  negentropy  £[/(§)  -  /(0*)]  is  given  by  the  Akaike  information  cri¬ 
terion.  Note  that  the  proof  of  this  is  much  more  general  than  that  originally  given  by 
Akaike  (1973)  since  it  applies  to  the  general  case  of  comparisons  of  nonnested  structures. 
Also,  the  true  parameter  9  need  not  be  contained  in  the  structures  being  compared  so 
long  as  the  Fisher  information  matrix  is  a  constant  in  a  neighborhood  including  the  true 
parameter  and  its  projection  onto  the  subspaces  of  these  structures. 
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5.  Benads  m  tbs  Accuracy  of  Modd  Selection 

To  obtain  a  correction  for  the  bias  in  the  sample  log  likelihood  as  an  estimate  of  the 
entropy  measure,  the  Fisher  information  must  be  constant  or  other  restrictions  are 
required.  In  this  section  the  Fisher  information  is  assumed  to  be  constant  in  a  neighbor* 
hood  of  the  true  parameter  0  containing  the  projection  0* .  Under  these  conditions  a 
lower  bound  on  the  entropy  measure  is  derived.  From  the  above  relationships  and  stan¬ 
dard  arguments  of  asymptotic  consistency,  it  follows  (Cox  and  Hinkley,  1974,  p.  292)  that 
the  constrained  maximum  likelihood  estimate  0*  is  consistent  and  the  limit  in  probability 
is  0* .  In  addition  the  properties  of  the  unconstrained  maximum  likelihood  translate  to  the 
projection  0*  since  the  Fisher  information  matrix  is  constant. 

Consider  first  the  case  of  estimating  the  model  0  using  the  *-th  order  maximum 
likelihood  estimator  0*.  Then  from  (3-1) 

*(0,0*)  *  D*£{(0‘-0kX«*-®*)r}  +  *(M*)  =  +  *(§, 0*)  (5-1) 

where  the  inequality  follows  from  the  Cramer-Rao  lower  bound 
£I[(0*-0*)r(-D*X®1'®1)]  a  trl*(kiN  =*  K(k)/N  where  K(k)  is  the  number  of  parameters 
estimated  in  the  model  0*  and  N  is  the  sample  size.  The  last  term  in  (5-1)  is  the  bias  in 
using  too  low  an  order  in  the  model  fitting,  and  the  first  term  is  the  sampling  variability 
apart  from  the  bias.  This  bound  K(k)/7N  on  the  variability  is  achieved  for  an  asymptoti¬ 
cally  unbiased  and  efficient  estimator  for  the  class  Ct  such  as  maximum  likelihood.  In 
particular,  if  the  true  model  order  is  no  greater  than  k ,  then 

Lim  N [*(0,0* )  -  *(0,0* )]*j  (5-2) 

The  true  order  k  of  the  process  is  usually  not  known  and  may  in  fact  be  infinite,  and  the 
bias  term  in  (S-l)  is  not  known  so  that  the  above  discussion  is  not  very  useful  in  practice 
although  it  doaa  give  some  insight  into  the  accuracy  issue. 

Consider  now  the  Akaike  MAICE  procedure  using  the  estimator  0A.  Assume  that 
the  true  model  order  is  infinite,  so  that  for  any  /  there  exists  a  j&j  such  that  0;X). 
Define  the  optimal  predictive  order  k'{N)  depending  upon  the  sample  size  N  as  the  order 
k  minimizing  the  negentropy  (5-1).  Then  under  suitable  assumptions,  the  remarkable 
result  is  obtained  by  Shibata  (1981a,  1981b,  1983)  that  asymptotically  as 


(i)  the  lower  bound  using  any  order  selection  scheme  is  for  each  N  given  by  evaluating 
(M)  at  *-*•(*),  and 

(ii)  this  lower  bound  is  achieved  by  the  MAICE  estimator  bA . 

Thus  the  lower  bound  oa  the  aegentropy  which  is  achieved  by  MAICE  is  equal  to  the 

negentropy  that  would  result  from  using  an  efficient  estimator  with  apriori  knowledge  of 

the  optimal  order  k*(N). 

Akaike,  H.,  (1973).  ^Information  theory  and  an  extension  of  the  minimum  likelihood  prin¬ 
ciple.*  In  2nd  International  Symposium  on  Information  Theory.,  Eds.  B.N.  Petrov  and 
F.  Csaki,  pp.  267*281.  Budapest:  Akademiai  Kiado. 

Anderson,  T.W.  (1971).  The  Statistical  Analysis  of  Time  Series.  New  York:  Wiley  and 
Sons. 

Cox,  DJI.  and  Hinkley,  D.V.  (1974).  Theoretical  Statistics.  London:  Chapman  and  Hall 

Larimore,  W£.  (1983).  *Predictive  inference,  sufficiency,  entropy,  and  an  asymptotic 
likelihood  principal*  Biometrika,  Vol  70,  pp.  173-81. 

Larimore,  W£.  (1977).  *Non tested  Tests  on  Model  Stricture.*  Proceedings  Joint  Automatic 
Control  Corf.  (San  Francisco,  CA).  New  York:  IEEE,  pp.  686-690. 

Shibeta,  R.,  (1983).  *A  Theoretical  View  of  the  Use  of  AIC,*  Proc.  of  the  International 
Tima  Series  Meeting,  OX).  Anderson,  Ed. 

Shibeta,  R.,  (1981a).  'An  optimal  autoregresnve  spectral  estimate .*  Ann.  Statist.  Vol  9, 
pp.  3004. 

Shibeta,  R.,  (1981b).  'An  optimal  selection  of  regresson  variables.*  Biometrika,  Vol  68,  pp. 
45-54. 


appendix  b 

ACHIEVABLE  ACCURACY  IN  PARAMETRIC 
ESTIMATION  OF  MULTIVARIATE  SPECTRA 


By  Wallace  E.  Larimore 

Scientific  Systems  Inc.,  Cambridge,  Massachusetts,  U.SA. 

Research  Sponsored  by  the  Air  Force  Office  of  Scientific  Research,  Air  Force  Systems 
Command,  USAF,  under  Contract  Number  F49620-85-C-0086. 


Paper  in  preparation  for  submission. 


1.  INTRODUCTION 

The  problem  of  determining  the  achievable  accuracy  in  identifying  a  model  for  a  station¬ 
ary  multiple  time  series  is  considered  in  this  paper.  The  cases  of  the  presence  or  absense  of  an 
exogeneous  input  or  additive  measurement  noise  are  included.  Consider  the  general  case  where 
x(r)  is  the  exogeneous  input  and  y(t)  is  the  observed  endogeneous  output  of  a  system  which  may 
include  other  unknown  excitations  and  measurment  noise.  Thus  consider  the  jointly  stationary 
gaussian  vector  time  series  x(r)  and  y(/),  r  =...,-1,0,1,...,  with  power  cross-spectral  matricies 
S^Od.O),  S^Od.O),  Sn(<n,9)  parameterized  by  0,  and  denote  the  power  cross-spectral  matrix  of  the 
joint  vector  (xr(r),  yT(t)f  as  5(<n,0)  . 

Statistical  inference  is  considered  on  a  class  of  linear  Gaussian  processes  parameterized  by 
0.  Specifying  a  parametric  model  for  the  conditional  process  y(t),  rii  given  x(t),  t<s  implies  a 
causal  linear  model  of  the  form 

y(0  *  <K0  +  XM'-t;G)*(0  =  <7(0  +  r(t)  (l) 

r» O 

where  h(t;d)  is  a  causal  linear  system  giving  the  response  r(i)  in  y(i)  due  to  the  past  exogeneous 
input  x(r)  and  where  q(t)  is  the  error  in  predicting  y(t)  by  r(i).  From  linear  prediction  theory 
(Gikhman  and  Skorokhod,  1969),  the  transfer  function  of  h(t  ;0)  is  //  (« ;0)  =S^ (u ,Q)S^(u ,0) ,  and 
the  error  q(t)  in  predicting  >(/)  is  uncorrelated  with  r(r)  with  power  spectrum  Sw(w;0)=sSw(<i>,uj 
-  H *(w, 0).  Note  that  any  class  of  parameterized  models  S(<i»,0)  can  be  equivalently 
specified  by  the  parameterized  models  (Sw(u,0),  H  (w,0))  which  will  prove  more  convienent. 

2.  ENTROPY  AND  SPECTRAL  ACCURACY 

Consider  the  following  predictive  inference  setting  (Larimore,  1983)  involving  an  observed 
informative  sample  uT  ={xT  (\)jT  (N),yT  (H))  of  size  S  used  to  estimate  the  process  model, 

and  similarly  consider  a  conceptual  predictive  sample  v  of  size  M  used  to  evaluate  the  accuracy  of 
the  estimated  model.  The  predictive  sample  is  assumed  to  be  identically  distributed  but  indepen¬ 
dent  of  the  informative  sample.  Consider  the  problem  of  inference  on  the  parametric  class 
{p(v,0),0€0}  of  models  with  probability  densities  p(v  ,0)  based  upon  the  informative  sample  u  . 
Consider  the  conceptual  repeated  sampling  experiment  where  on  each  trial  the  samples  u  and  v 
are  each  drawn  independently  from  the  process  5(<i>,0.)  with  0.  assumed  to  be  the  true  parameter 
value.  An  estimative  model  p=p(v  Mu))  is  chosen  for  the  density  of  v  by  some  parameter  estima¬ 
tion  scheme  0(u)  .  The  negative  entropy,  also  known  as  the  expected  Kullback-Leibler  discrimina¬ 
tion  information  or  expected  I-divergence,  is  a  measure  . *  the  error  in  approximating  the  true 
density  p.  of  v  by  the  estimate  p  and  is  given  by 


B-2 


R(p,f)  =  EuK(p,j>)  =  £„  /  p(v,6.)  log~~~  ’  dv  (2) 

where  £,  denote*  expectation  relative  to  u  and  K  denotes  the  Kullback-Leibler  information.  The 
negative  entropy  measure  follows  as  the  natural  measure  in  the  predictive  inference  setting  from 
the  fundamental  principles  of  sufficiency  and  repeated  sampling  (Larimore,  1983).  This 
approach  applies  to  very  general  modeling  methods  such  as  nonparametric,  semi  parametric  or 
parametric  procedures  as  well  as  methods  including  decisions  on  model  structure  or  order  such  as 
those  used  for  AR  and  ARMA  modeling. 

Let  lower  case  variables  denote  a  sample  of  size  M  of  the  predictive  sample,  e  g. 
y=(yT(l)jT(2),...yT(M)?  and  denote  the  covariance  matrix  of  y.  By  expressing  the  density 
p(yjfi)  *  p(y-r  fi)  p(x;0)  in  terms  of  the  conditional  random  process  q(t)  ■  y(t)-r(f), 

p(y*fi)  -  p(y  I  X  ,0)  p(x  ,0)  =  p(y  -  r(x  ,e),8)  p(x  ,6)  -  p{q\ r(x  ,0),0)  p(x  ,0)  (3) 

the  log  likelihood  separates  with  the  density  of  x(t)  in  many  problems  not  a  function  of  the  unk¬ 
nown  parameters  or  at  least  a  function  of  a  separate  set  of  parameters.  A  conditional  viewpoint 
is  taken  in  the  following  where  only  the  conditional  term  p(q\  r(x,8),0)  with  x  considered  as  non- 
random  is  considered.  The  depencence  of  r  on  x  will  be  understood  in  the  notation.  Inclusion  of 
the  second  term  is  tantamount  to  modeling  the  joint  vector  time  series  involving  the  two  series 
x(r)  and  y(t)  jointly  rather  than  as  exogeneous  and  endogeneous  respectively.  The  joint  case  is 
included  as  a  special  case  of  y7 (/)  =  (yT(t)ptT(f))  a  vector  process  with  no  input  x(r)  which  will 
be  discussed  as  a  particular  instance  of  the  model  throughout  the  paper.  The  I-divergence  (2)  con¬ 
ditional  on  x  thus  becomes 

t  p(^|x,0.) 

K(p*j)  =  J  p(q\xfi*)  log— - — 7-  dq  =£logp(y-r.r2  )-£logp(y-r^_) 

p(q\x,9)  w 

”  E\ogp{y-r.2")  -  £logp((y-r.)+<r.-r)^w) 


-  £k>gp(y-r.,2w)  -  £logp((y-r.)i„)  -  E (r.-rf  t^(r.-r)  (4) 

where  t  -r(x,0),  r.  -  r(x  ,0.),  and  where  E  denotes  expectation  with  respect  to  the  density 

p(y I  *,*•)- 

For  brevity  set  S  denote  the  true  spectrum,  and  let  S  denote  an  estimate  of  S.  We  will 
need  to  assume  that  J(< ■>)  is  conunuous  and  that  5w(w)  and  5^ («)  are  positive  definite  for 
w€(-w,ir]  .  In  the  discussion,  the  predictive  sample  v  will  be  considered  to  be  conditional  on  x(i) 
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and  to  have  an  infinite  sample  size  M  This  will  require  the  normalization  of  the  negative 
entropy  and  l-divergence  by  the  sample  size  M  The  I -divergence  per  sample  time  conditional  on 
x(r),  which  will  be  denoted  /(5  J)  and  called  I -divergence  for  brevity,  can  be  expressed  using  (4) 
as  (Kazak cm  and  Papantoni-Kazakoa,  1980) 

/(J  J)  m  lim-^-E(p(vw  ,8(m#)) 

-  -  J  /  Uog\  S"(*)S  „'(*){  -  ,r[l  - 

j  -  H(u>)T)4£  (5) 

where  the  subscript  emphasuet  that  the  sample  of  size  M  of  v  becomes  infinite.  The  negative 
entropy  per  sample,  or  ntgentropy  lor  brevity,  is  defined  as  N(S  J)=iim-^-R(p.j})  -EmI(Sj). 

Note  that  the  I-divergence  is  composed  of  two  terms,  the  last  due  to  the  error  in  estimating  the 
transfer  function  H(m)  and  the  first  due  to  the  error  in  estimating  the  spectrum  5W(«)  of  the 
noise  q(i).  A  useful  approximation  for  the  first  term  in  (5)  is 

-  \  f{‘og I  +  tr\!  - 

-W 

25  (6) 

which  holds  to  second  order  in  the  elements  of  Sn  as  is  easily  shown  by  comparing  first  and 
second  derivatives  of  the  integrands.  This  is  a  generalization  to  the  multivariate  case  of  the 
integral  of  the  squared  relative  error.  Thus  the  I-divergence  is  approximately  a  quadratic  form  in 
the  estimation  errors  of  S^M  and  //  (<*>),  and  these  quadratic  forms  do  not  interact,  i.e.  there 
are  no  crom  terms. 

3.  NORMALIZED  SPECTRAL  ERROR  IN  PRINCIPAL  COMPONENTS 

In  the  muitple  time  series  case,  the  spectral  measure  (5)  has  an  intuitive  interpretation  in 
terms  of  principal  components  of  the  power  spectrum  in  the  frequency  domain.  Principal  com¬ 
ponent  representations  of  the  spectral  matricies  SWM  and  Sa(w)  have  the  form 


/(«)  s„M7‘M  -  DM  ,  L M  S0M  L'M  =  EM 


(7) 


where  J(m)  and  L(m)  given  as  a  function  of  frequency  u>  are  unitary  matrix  transformations  which 
diagonalize  Sa(m)  and  Jw(«)  respectively  so  that  J (u»y*(w)  *  /  =  L(u)L'(<j),  and  where  D(u») 
and  £(«)  are  diagonal  matricies. 

Using  spectral  factorization  theory,  the  matrix  function  /(<■>)  can  be  chosen  as  a  continu¬ 
ous  function  of  w  and  the  transfer  function  of  a  causal  filter  under  either  of  the  mild  assumptions 

(i)  The  process  is  purely  nonde  termini  Stic  ((Gikhman  and  Skorokhod  (1969),  Whittle(1954)  for 
the  scalar  random  field  case). 

(ii)  The  autocovariance  function  is  absolute  summable  (Goodman  and  Ekstrom  (1980)  for  the 
scalar  random  field  case). 

These  derivations  of  the  spectral  factorization  for  the  scalar  case  generalize  to  the  multivariable 
case  with  care  given  to  determining  the  logarithm  of  a  matrix  (Larimore,  1984,  1977).  Otthonor- 
malization  of  the  rows  of  the  spectral  factor  gives  J  (<■>)  while  the  normalizing  terms  form  the  diag¬ 
onal  of  D( w) .  Similarly,  L(u>)  can  be  taken  as  a  spectral  factor  of  a  causal  filter. 

Let  Jf(e)  be  the  random  Fourier  coefficients  of  x(r),  iz.  the  spectral  random  measure  of 
x(r )  .  Filtering  x(r)  with  transfer  function  £.(«)  gives  the  principal  component  process  x(r)  which 
is  expressed  in  the  frequency  domain  as  X(w)  *  L(u)X(u)  ,  and  which  has  the  diagonal  spectral 
matrix  £(<■»)  and  similarly  for  q(t). 

Now  consider  the  asymptotically  equivalent  expression  (6)  for  the  first  term  of  the  spectral 
measure  (5)  which  is  invariant  to  the  unitary  transformation  J(w)  and  is  thus  given  by 

_  1  v  VO«(*0-Pa(»)f  1  \  |0/;(«)[2  du> 

D*(*)  r  ^  2  Zil'DB{m')DjJ{*)  4w 

V />,(«*) -P,(«)  1  *  |Q<y(«)|2  do, 

4  ,L  D„(m)  r4w  2  4w  () 

where  the  approximation  holds  for  the  diagonal  elements  of  D(u)  near  D(u) .  The  approximation 
is  very  useful  when  only  the  estimated  spectrum  («)  is  known  and  we  wish  to  consider  the 
error  in  estimating  the  truth  .  The  first  sum  on  the  right  hand  side  is  the  integrated 

squared  relative  error  of  the  estimated  cospectra  of  the  principal  components,  while  the  second 
term  is  the  integrated  squared  coherency  of  the  estimated  spectrum  O(w)  which  would  be  zero  if 
D(«)-D(a) .  Thus  the  first  term  of  the  measure  (5)  has  a  clear  interpretation  in  the  multivariate 
cue  when  the  true  spectrum  D(u)  is  diagonal  but  where  the  approximating  spectrum  £)(<■»)  is 
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arbitrary  coherency  among  components.  The  general  case  is  reduced  to  this  diagonal  case  by 
choosing  an  appropriate  filter  7(«)  which  diagonalizes  £„(<*>)  as  in  (7). 

The  second  term  in  the  spectral  measure  (5)  is  invariant  to  the  unitary  transformations 
J(m)  and  L(m)  which  gives 

-  i  JiHO'WgW  -  C{.)1E(.)(C(.)  -  C(„)n£ 

-  -7/XIC,/-)-  c„(») |  (9) 

L  -+IJ  CJJ 

where  G  («)  -7  («y/  («)£.’  (w)  is  the  transfer  function  H  (w)  expreamd  in  the  coordinate  frame  of 
the  principal  component  series  x(t)  and  y(t).  The  squared  magnitude  error  |  G(;(«)  -  c/y(«)|  2  in 
the  i  J  element  of  the  transfer  function  is  weighted  by  the  input  signal  to  output  noise  ratio 
Du  E)j  for  the  pair  i  J. 

4.  A  LOWER  BOUND  ON  ACHIEVABLE  SPECTRAL  ACCURACY 

A  lower  bound  on  the  expected  negative  entropy  gives  an  asymptotic  lower  bound  on  the 
achievable  accuracy  in  the  estimation  of  the  process  spectrum  and  transfer  function.  This  applies 
to  the  caae  of  a  fixed  known  model  order  as  well  as  to  the  case  of  an  unknown  or  infinite  model 
order  with  the  use  of  the  AJC  for  model  order  selection.  The  achievable  spectral  accuracy  is 
given  as  a  function  of  the  sample  size  and  the  number  of  parameters  estimated. 

Consider  the  case  where  the  model  S  is  estimated  using  a  K  dimensional  constrained  estima¬ 
tor  9* .  As  derived  in  Larimore  (1986)  using  the  Cramer-Rao  lower  bound,  the  expected  negen- 
tropy  is  asymptotically  bounded  by 

+EUI(SJ)  (10) 

where  S  is  the  model  to  which  the  constrained  maximum  likelihood  estimate  converges  so  that 
£a/(5  J)  is  the  bias  in  the  constrained  model  and  K/2N  is  the  sampling  variability. 

The  bound  K/2N  is  achieved  for  an  asymptotically  unbiased  and  efficient  estimator  such  as 
maximum  likelihood.  In  particular,  this  assumes  that  the  true  model  order  is  no  greater  than  the 
order  K  used  in  the  model  fitting.  The  true  order  IT  of  the  process  is  usually  not  known  and  may 


inf  act  be  infinite,  so  that  the  above  discussion  is  not  very  useful  in  practice  although  it  does  give 
some  insight  into  the  accuracy  issue. 

Consider  now  the  AJtaike  minimum  A1C  procedure  (MA1CE)  using  the  estimator  0A 
(Akaike,  1973).  Assume  that  the  true  model  order  is  infinite,  so  that  for  any  subset  of  the  infinite 
parameter  vector,  there  exist  nonzero  components.  Thus  it  is  not  possible  to  obtain  asymptotically 
unbiased  estimates  of  0  using  a  fixed  model  order  in  estimating  a  model.  Following  Shibata 
(1983,  1981a,  1981b),  define  the  optimal  predictive  order  K'(N)  depending  upoo  the  sample  size 
N  as  the  order  K.  minimizing  the  negentropy  (10).  Then  under  suitable  assumptions,  the  remark¬ 
able  result  is  obtained  by  Shibata  (1981)  that  asymptotically  as  W-® 

(i)  the  lower  bound  using  any  order  selection  scheme  is  given  by  evaluating  (10)  at  ,  and 

(ii)  this  lower  bound  is  achieved  by  the  MA1CE  estimator  &A . 

Thus  the  lower  bound  on  the  negentopy  which  is  achieved  by  MA1CE  is  equal  to  the  negentropy 
that  would  result  from  using  an  efficient  estimator  with  apriori  knowledge  of  the  optimal  order 
K'(N). 

Using'  the  spectral  expression  (5),  an  asymptotic  lower  bound  on  the  expectation  of  the  gen¬ 
eralized  relative  squared  error  in  estimating  the  power  spectrum  is  given  by 

s  i  E. 

+  }  £.  -  «(»))s„(») !«<«•)  -  fiwn^  (11) 

This  gives  a  fundamental  limit  to  the  achievable  accuracy  in  any  parametric  estimation  pro¬ 
cedure.  A  further  perspective  on  this  fact  is  given  by  the  justification  of  the  expected  negentropy 
as  the  natural  measure  of  modeling  approximation  error  in  statistical  inference. 
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